2025-05-07T20:23:26.0473677Z Current runner version: '2.323.0'
2025-05-07T20:23:26.0479361Z Runner name: 'i-050728826a2d12e7e'
2025-05-07T20:23:26.0480277Z Machine name: 'ip-10-0-27-143'
2025-05-07T20:23:26.0482959Z ##[group]GITHUB_TOKEN Permissions
2025-05-07T20:23:26.0485209Z Contents: read
2025-05-07T20:23:26.0485727Z Metadata: read
2025-05-07T20:23:26.0486212Z Packages: read
2025-05-07T20:23:26.0486696Z ##[endgroup]
2025-05-07T20:23:26.0488531Z Secret source: None
2025-05-07T20:23:26.0489151Z Prepare workflow directory
2025-05-07T20:23:26.1403938Z Prepare all required actions
2025-05-07T20:23:26.1445625Z Getting action download info
2025-05-07T20:23:26.3842065Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683)
2025-05-07T20:23:26.6621605Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093)
2025-05-07T20:23:27.0102283Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187)
2025-05-07T20:23:28.6667594Z Getting action download info
2025-05-07T20:23:28.7699875Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482)
2025-05-07T20:23:28.9864348Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.12, 12.6.3, 12.6.3, gcc)
2025-05-07T20:23:29.0459097Z A job started hook has been configured by the self-hosted runner administrator
2025-05-07T20:23:29.0593199Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh'
2025-05-07T20:23:29.0605717Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:29.0607207Z ##[endgroup]
2025-05-07T20:23:30.0173772Z Runner Type: linux.g5.4xlarge.nvidia.gpu
2025-05-07T20:23:30.0174175Z Instance Type: g5.4xlarge
2025-05-07T20:23:30.0174421Z AMI Name: unknown
2025-05-07T20:23:30.0211216Z AMI ID: ami-071226ecf16aa7d96
2025-05-07T20:23:35.3980825Z ##[group]Run actions/checkout@v4
2025-05-07T20:23:35.3981139Z with:
2025-05-07T20:23:35.3981391Z   submodules: true
2025-05-07T20:23:35.3981628Z   repository: pytorch/FBGEMM
2025-05-07T20:23:35.3982026Z   token: ***
2025-05-07T20:23:35.3982228Z   ssh-strict: true
2025-05-07T20:23:35.3982445Z   ssh-user: git
2025-05-07T20:23:35.3982671Z   persist-credentials: true
2025-05-07T20:23:35.3982928Z   clean: true
2025-05-07T20:23:35.3983164Z   sparse-checkout-cone-mode: true
2025-05-07T20:23:35.3983430Z   fetch-depth: 1
2025-05-07T20:23:35.3983647Z   fetch-tags: false
2025-05-07T20:23:35.3983864Z   show-progress: true
2025-05-07T20:23:35.3984091Z   lfs: false
2025-05-07T20:23:35.3984300Z   set-safe-directory: true
2025-05-07T20:23:35.3984563Z env:
2025-05-07T20:23:35.3984779Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:35.3985091Z   BUILD_ENV: build_binary
2025-05-07T20:23:35.3985356Z   BUILD_TARGET: genai
2025-05-07T20:23:35.3985584Z   BUILD_VARIANT: cuda
2025-05-07T20:23:35.3985850Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:35.3986099Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:35.3986337Z ##[endgroup]
2025-05-07T20:23:35.5144827Z Syncing repository: pytorch/FBGEMM
2025-05-07T20:23:35.5146010Z ##[group]Getting Git version info
2025-05-07T20:23:35.5146454Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:23:35.5147064Z [command]/usr/bin/git version
2025-05-07T20:23:35.5147325Z git version 2.47.1
2025-05-07T20:23:35.5165640Z ##[endgroup]
2025-05-07T20:23:35.5179163Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/3562e8cb-07d7-40de-aedb-7c23eadca378' before making global git config changes
2025-05-07T20:23:35.5180065Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:23:35.5192952Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:35.5233747Z [command]/usr/bin/git config --local --get remote.origin.url
2025-05-07T20:23:35.5257178Z https://github.com/pytorch/FBGEMM
2025-05-07T20:23:35.5275909Z ##[group]Removing previously created refs, to avoid conflicts
2025-05-07T20:23:35.5281541Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD
2025-05-07T20:23:35.5306758Z refs/heads/main
2025-05-07T20:23:35.5315762Z [command]/usr/bin/git checkout --detach
2025-05-07T20:23:36.3968003Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:36.4019283Z [command]/usr/bin/git branch --delete --force main
2025-05-07T20:23:36.4045534Z Deleted branch main (was b6b2ce3).
2025-05-07T20:23:36.4051837Z ##[endgroup]
2025-05-07T20:23:36.4054648Z [command]/usr/bin/git submodule status
2025-05-07T20:23:36.4475230Z  e5d7c0bd5d9aec44d68830187138149e6a8c4e32 external/asmjit (e5d7c0b)
2025-05-07T20:23:36.4562697Z  4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 external/composable_kernel (4a61bdd)
2025-05-07T20:23:36.4650213Z  6543fec09b2f04ac4a666882998b534afc9c1349 external/cpuinfo (6543fec)
2025-05-07T20:23:36.4740711Z  3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 external/cutlass (3ed8d2e)
2025-05-07T20:23:36.4826550Z  f8d7d77c06936315286eb55f8de22cd23c188571 external/googletest (f8d7d77)
2025-05-07T20:23:36.4911784Z  420084499c7c1e1c2d801922f40df202eac5f3a0 external/hipify_torch (4200844)
2025-05-07T20:23:36.4997148Z  9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 external/json (9cca280)
2025-05-07T20:23:36.5010571Z ##[group]Cleaning the repository
2025-05-07T20:23:36.5015799Z [command]/usr/bin/git clean -ffdx
2025-05-07T20:23:36.5074428Z [command]/usr/bin/git reset --hard HEAD
2025-05-07T20:23:36.5181013Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:36.5188029Z ##[endgroup]
2025-05-07T20:23:36.5190110Z ##[group]Disabling automatic garbage collection
2025-05-07T20:23:36.5194792Z [command]/usr/bin/git config --local gc.auto 0
2025-05-07T20:23:36.5226330Z ##[endgroup]
2025-05-07T20:23:36.5226726Z ##[group]Setting up auth
2025-05-07T20:23:36.5232416Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:23:36.5275512Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:23:36.5607798Z Entering 'external/asmjit'
2025-05-07T20:23:36.5674802Z Entering 'external/composable_kernel'
2025-05-07T20:23:36.5746658Z Entering 'external/cpuinfo'
2025-05-07T20:23:36.5813926Z Entering 'external/cutlass'
2025-05-07T20:23:36.5888311Z Entering 'external/googletest'
2025-05-07T20:23:36.5952453Z Entering 'external/hipify_torch'
2025-05-07T20:23:36.6017237Z Entering 'external/json'
2025-05-07T20:23:36.6103114Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:23:36.6135732Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:23:36.6467634Z Entering 'external/asmjit'
2025-05-07T20:23:36.6531995Z Entering 'external/composable_kernel'
2025-05-07T20:23:36.6604255Z Entering 'external/cpuinfo'
2025-05-07T20:23:36.6668576Z Entering 'external/cutlass'
2025-05-07T20:23:36.6743696Z Entering 'external/googletest'
2025-05-07T20:23:36.6812035Z Entering 'external/hipify_torch'
2025-05-07T20:23:36.6875658Z Entering 'external/json'
2025-05-07T20:23:36.6963727Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:23:36.7014623Z ##[endgroup]
2025-05-07T20:23:36.7015052Z ##[group]Fetching the repository
2025-05-07T20:23:36.7021960Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge
2025-05-07T20:23:36.9346192Z From https://github.com/pytorch/FBGEMM
2025-05-07T20:23:36.9346693Z  * [new ref]         a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge
2025-05-07T20:23:36.9372855Z ##[endgroup]
2025-05-07T20:23:36.9373349Z ##[group]Determining the checkout info
2025-05-07T20:23:36.9374518Z ##[endgroup]
2025-05-07T20:23:36.9379208Z [command]/usr/bin/git sparse-checkout disable
2025-05-07T20:23:36.9431507Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig
2025-05-07T20:23:36.9460139Z ##[group]Checking out the ref
2025-05-07T20:23:36.9464299Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge
2025-05-07T20:23:36.9585733Z Previous HEAD position was b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:36.9589074Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4
2025-05-07T20:23:36.9598413Z ##[endgroup]
2025-05-07T20:23:36.9598819Z ##[group]Setting up auth for fetching submodules
2025-05-07T20:23:36.9604124Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:23:36.9651775Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf
2025-05-07T20:23:36.9682474Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com:
2025-05-07T20:23:36.9713482Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com:
2025-05-07T20:23:36.9741845Z ##[endgroup]
2025-05-07T20:23:36.9742215Z ##[group]Fetching submodules
2025-05-07T20:23:36.9744981Z [command]/usr/bin/git submodule sync
2025-05-07T20:23:37.0119346Z Synchronizing submodule url for 'external/asmjit'
2025-05-07T20:23:37.0119815Z Synchronizing submodule url for 'external/composable_kernel'
2025-05-07T20:23:37.0120243Z Synchronizing submodule url for 'external/cpuinfo'
2025-05-07T20:23:37.0120616Z Synchronizing submodule url for 'external/cutlass'
2025-05-07T20:23:37.0121303Z Synchronizing submodule url for 'external/googletest'
2025-05-07T20:23:37.0121712Z Synchronizing submodule url for 'external/hipify_torch'
2025-05-07T20:23:37.0122112Z Synchronizing submodule url for 'external/json'
2025-05-07T20:23:37.0136214Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1
2025-05-07T20:23:37.0563707Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32'
2025-05-07T20:23:37.0714627Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406'
2025-05-07T20:23:37.0817950Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349'
2025-05-07T20:23:37.0987344Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3'
2025-05-07T20:23:37.1077753Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571'
2025-05-07T20:23:37.1158493Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0'
2025-05-07T20:23:37.1259796Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03'
2025-05-07T20:23:37.1277242Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0
2025-05-07T20:23:37.1610184Z Entering 'external/asmjit'
2025-05-07T20:23:37.1642109Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.1674353Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.1706473Z Entering 'external/cutlass'
2025-05-07T20:23:37.1738616Z Entering 'external/googletest'
2025-05-07T20:23:37.1772060Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.1804193Z Entering 'external/json'
2025-05-07T20:23:37.1850415Z ##[endgroup]
2025-05-07T20:23:37.1850818Z ##[group]Persisting credentials for submodules
2025-05-07T20:23:37.1857394Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :"
2025-05-07T20:23:37.2188087Z Entering 'external/asmjit'
2025-05-07T20:23:37.2231187Z url.https://github.com/.insteadof
2025-05-07T20:23:37.2231867Z url.https://github.com/.insteadof
2025-05-07T20:23:37.2275257Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.2317789Z url.https://github.com/.insteadof
2025-05-07T20:23:37.2318249Z url.https://github.com/.insteadof
2025-05-07T20:23:37.2368649Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.2410082Z url.https://github.com/.insteadof
2025-05-07T20:23:37.2410513Z url.https://github.com/.insteadof
2025-05-07T20:23:37.2457314Z Entering 'external/cutlass'
2025-05-07T20:23:37.2502545Z url.https://github.com/.insteadof
2025-05-07T20:23:37.2502990Z url.https://github.com/.insteadof
2025-05-07T20:23:37.2554641Z Entering 'external/googletest'
2025-05-07T20:23:37.2599392Z url.https://github.com/.insteadof
2025-05-07T20:23:37.2599829Z url.https://github.com/.insteadof
2025-05-07T20:23:37.2643652Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.2688611Z url.https://github.com/.insteadof
2025-05-07T20:23:37.2689057Z url.https://github.com/.insteadof
2025-05-07T20:23:37.2730874Z Entering 'external/json'
2025-05-07T20:23:37.2777939Z url.https://github.com/.insteadof
2025-05-07T20:23:37.2778387Z url.https://github.com/.insteadof
2025-05-07T20:23:37.2838096Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url"
2025-05-07T20:23:37.3171432Z Entering 'external/asmjit'
2025-05-07T20:23:37.3232707Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config	remote.origin.url
2025-05-07T20:23:37.3235233Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.3296643Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config	remote.origin.url
2025-05-07T20:23:37.3299516Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.3365377Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config	remote.origin.url
2025-05-07T20:23:37.3366496Z Entering 'external/cutlass'
2025-05-07T20:23:37.3424169Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config	remote.origin.url
2025-05-07T20:23:37.3426834Z Entering 'external/googletest'
2025-05-07T20:23:37.3487691Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config	remote.origin.url
2025-05-07T20:23:37.3491443Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.3550257Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config	remote.origin.url
2025-05-07T20:23:37.3552976Z Entering 'external/json'
2025-05-07T20:23:37.3614946Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config	remote.origin.url
2025-05-07T20:23:37.3738953Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:'
2025-05-07T20:23:37.4071367Z Entering 'external/asmjit'
2025-05-07T20:23:37.4103394Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.4135821Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.4170342Z Entering 'external/cutlass'
2025-05-07T20:23:37.4201676Z Entering 'external/googletest'
2025-05-07T20:23:37.4234650Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.4268866Z Entering 'external/json'
2025-05-07T20:23:37.4317023Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:'
2025-05-07T20:23:37.4647914Z Entering 'external/asmjit'
2025-05-07T20:23:37.4681838Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.4714549Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.4746342Z Entering 'external/cutlass'
2025-05-07T20:23:37.4778727Z Entering 'external/googletest'
2025-05-07T20:23:37.4810902Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.4843179Z Entering 'external/json'
2025-05-07T20:23:37.4888829Z ##[endgroup]
2025-05-07T20:23:37.4929165Z [command]/usr/bin/git log -1 --format=%H
2025-05-07T20:23:37.4955711Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:37.5143510Z ##[group]Run actions/download-artifact@v4
2025-05-07T20:23:37.5143835Z with:
2025-05-07T20:23:37.5144072Z   name: fbgemm_genai_x86_gcc_py3.12_cu12.6.3.whl
2025-05-07T20:23:37.5144382Z   merge-multiple: false
2025-05-07T20:23:37.5144639Z   repository: pytorch/FBGEMM
2025-05-07T20:23:37.5144890Z   run-id: 14891846252
2025-05-07T20:23:37.5145106Z env:
2025-05-07T20:23:37.5145326Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:37.5145616Z   BUILD_ENV: build_binary
2025-05-07T20:23:37.5145862Z   BUILD_TARGET: genai
2025-05-07T20:23:37.5146078Z   BUILD_VARIANT: cuda
2025-05-07T20:23:37.5146309Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:37.5146558Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:37.5146792Z ##[endgroup]
2025-05-07T20:23:37.7509955Z Downloading single artifact
2025-05-07T20:23:37.9190612Z Preparing to download the following artifacts:
2025-05-07T20:23:37.9191431Z - fbgemm_genai_x86_gcc_py3.12_cu12.6.3.whl (ID: 3081362852, Size: 12511372, Expected Digest: sha256:fda2094d8736a8502a6727b9a5f7a5a78f8048753893d498f4d03c0c6fa9ef69)
2025-05-07T20:23:37.9729964Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-ec3a2fd8-75ec-5d2c-a37b-ee6ee19c88ae/artifacts/768a04041691747daab1e752da2c135b903b31da5ee0699a6f825976517e0bc8.zip
2025-05-07T20:23:37.9731358Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:38.0567044Z (node:68266) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2025-05-07T20:23:38.0567968Z (Use `node --trace-deprecation ...` to show where the warning was created)
2025-05-07T20:23:38.2734282Z SHA256 digest of downloaded artifact is fda2094d8736a8502a6727b9a5f7a5a78f8048753893d498f4d03c0c6fa9ef69
2025-05-07T20:23:38.2734947Z Artifact download completed successfully.
2025-05-07T20:23:38.2735329Z Total of 1 artifact(s) downloaded
2025-05-07T20:23:38.2741342Z Download artifact has finished successfully
2025-05-07T20:23:38.2984994Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main
2025-05-07T20:23:38.2985383Z with:
2025-05-07T20:23:38.2985597Z   driver-version: 570.133.07
2025-05-07T20:23:38.2985836Z env:
2025-05-07T20:23:38.2986057Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:38.2986360Z   BUILD_ENV: build_binary
2025-05-07T20:23:38.2986598Z   BUILD_TARGET: genai
2025-05-07T20:23:38.2986830Z   BUILD_VARIANT: cuda
2025-05-07T20:23:38.2987067Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:38.2987325Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:38.2987552Z ##[endgroup]
2025-05-07T20:23:38.3084044Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
2025-05-07T20:23:38.3084426Z with:
2025-05-07T20:23:38.3084630Z   timeout_minutes: 10
2025-05-07T20:23:38.3084866Z   max_attempts: 3
2025-05-07T20:23:38.3107789Z   command: # Is it disgusting to have a full shell script here in this github action? Sure
# But is it the best way to make it so that this action relies on nothing else? Absolutely
set -eou pipefail

DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID)
DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run"

install_nvidia_docker2_amzn2() {
    (
        set -x
        # Needed for yum-config-manager
        sudo yum install -y yum-utils
        if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then
          YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo"
        else
          # Amazon Linux 2
          YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
        fi

        sudo yum-config-manager --add-repo "${YUM_REPO_URL}"
        sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
        sudo systemctl restart docker
    )
}

install_nvidia_docker2_ubuntu20() {
    (
        set -x
        # Install nvidia-driver package if not installed
        status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)"
        if [ ! $? = 0 ] || [ ! "$status" = installed ]; then
          sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
          sudo systemctl restart docker
        fi
    )
}

pre_install_nvidia_driver_amzn2() {
    (
        # Purge any nvidia driver installed from RHEL repo
        sudo yum remove -y nvidia-driver-latest-dkms
    )
}

install_nvidia_driver_common() {
    (
        # Try to gather more information about the runner and its existing NVIDIA driver if any
        echo "Before installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        HAS_NVIDIA_DRIVER=0
        # Check if NVIDIA driver has already been installed
        if [ -x "$(command -v nvidia-smi)" ]; then
            set +e
            # The driver exists, check its version next. Also check only the first GPU if there are more than one of them
            # so that the same driver version is not print over multiple lines
            INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
            NVIDIA_SMI_STATUS=$?

            if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing"
            elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing"

                # Turn off persistent mode so that the installation script can unload the kernel module
                sudo killall nvidia-persistenced || true
            else
                HAS_NVIDIA_DRIVER=1
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation"
            fi
            set -e
        fi

        if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then
            # CAUTION: this may need to be updated in future
            if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then
                  sudo yum groupinstall -y "Development Tools"
                  # ensure our kernel install is the same as our underlying kernel,
                  # groupinstall "Development Tools" has a habit of mismatching kernel headers
                  sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
                  sudo modprobe backlight
            fi
            sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"

            set +e
            sudo /bin/bash /tmp/nvidia_driver -s --no-drm
            NVIDIA_INSTALLATION_STATUS=$?

            RESET_GPU=0
            if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then
                sudo cat /var/log/nvidia-installer.log
                # Fail to install NVIDIA driver, try to reset the GPU
                RESET_GPU=1
            elif [ -x "$(command -v nvidia-smi)" ]; then
                # Check again if nvidia-smi works even if the driver installation completes successfully
                INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
                NVIDIA_SMI_STATUS=$?

                if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                    RESET_GPU=1
                fi
            fi

            if [ "$RESET_GPU" -eq 1 ]; then
                NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1)
                # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this
                # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388
                for PCI_ID in $NVIDIA_DEVICES; do
                    DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable)

                    echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)"
                    # This requires sudo permission of course
                    echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset
                    sleep 1
                done
            fi

            sudo rm -fv /tmp/nvidia_driver
            set -e
        fi
    )
}

post_install_nvidia_driver_common() {
    (
        sudo modprobe nvidia || true
        echo "After installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        (
            set +e

            nvidia-smi
            # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in
            # the case where the driver has already crashed as it still can get the driver version
            # and some basic information like the bus ID.  However, the rest of the information
            # would be missing (ERR!), for example:
            #
            # +-----------------------------------------------------------------------------+
            # | NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
            # |-------------------------------+----------------------+----------------------+
            # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
            # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
            # |                               |                      |               MIG M. |
            # |===============================+======================+======================|
            # |   0  ERR!                Off  | 00000000:00:1E.0 Off |                 ERR! |
            # |ERR!  ERR! ERR!    ERR! / ERR! |   4184MiB / 23028MiB |    ERR!      Default |
            # |                               |                      |                 ERR! |
            # +-------------------------------+----------------------+----------------------+
            #
            # +-----------------------------------------------------------------------------+
            # | Processes:                                                                  |
            # |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
            # |        ID   ID                                                   Usage      |
            # |=============================================================================|
            # +-----------------------------------------------------------------------------+
            #
            # This should be reported as a failure instead as it will guarantee to fail when
            # Docker tries to run with --gpus all
            #
            # So, the correct check here is to query one of the missing piece of info like
            # GPU name, so that the command can fail accordingly
            nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
            NVIDIA_SMI_STATUS=$?

            # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285
            if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then
                echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}"
            else
                echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}"
                exit ${NVIDIA_SMI_STATUS}
            fi
            set -e
        )
    )
}

install_nvidia_driver_amzn2() {
    (
        set -x
        pre_install_nvidia_driver_amzn2
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

install_nvidia_driver_ubuntu20() {
    (
        set -x
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

echo "== Installing nvidia driver ${DRIVER_FN} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_driver_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_driver_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac

# Install container toolkit based on distribution
echo "== Installing nvidia container toolkit for ${DISTRIBUTION} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_docker2_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_docker2_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac
echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"

# Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with
# more than one GPUs. This just needs to be run once. The command fails
# on subsequent runs and complains that the mode is already on, but that's
# ok
sudo nvidia-persistenced || true
# This should show persistence mode ON
nvidia-smi

2025-05-07T20:23:38.3130826Z   retry_wait_seconds: 10
2025-05-07T20:23:38.3131088Z   polling_interval_seconds: 1
2025-05-07T20:23:38.3131348Z   warning_on_retry: true
2025-05-07T20:23:38.3131595Z   continue_on_error: false
2025-05-07T20:23:38.3132070Z env:
2025-05-07T20:23:38.3132347Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:38.3132757Z   BUILD_ENV: build_binary
2025-05-07T20:23:38.3133192Z   BUILD_TARGET: genai
2025-05-07T20:23:38.3133502Z   BUILD_VARIANT: cuda
2025-05-07T20:23:38.3150434Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:38.3150708Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:38.3150950Z   DRIVER_VERSION: 570.133.07
2025-05-07T20:23:38.3151189Z ##[endgroup]
2025-05-07T20:23:38.3850324Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run ==
2025-05-07T20:23:38.3851809Z + pre_install_nvidia_driver_amzn2
2025-05-07T20:23:38.3853859Z + sudo yum remove -y nvidia-driver-latest-dkms
2025-05-07T20:23:38.7325488Z No match for argument: nvidia-driver-latest-dkms
2025-05-07T20:23:38.7326209Z No packages marked for removal.
2025-05-07T20:23:38.7388758Z Dependencies resolved.
2025-05-07T20:23:38.7398427Z Nothing to do.
2025-05-07T20:23:38.7398914Z Complete!
2025-05-07T20:23:38.7745432Z + install_nvidia_driver_common
2025-05-07T20:23:38.7749454Z + echo 'Before installing NVIDIA driver'
2025-05-07T20:23:38.7749748Z + lspci
2025-05-07T20:23:38.7751508Z Before installing NVIDIA driver
2025-05-07T20:23:38.7875067Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:38.7876001Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:38.7876543Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:38.7877048Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:23:38.7877515Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:23:38.7878027Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:38.7878489Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:38.7878958Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:23:38.7879397Z + lsmod
2025-05-07T20:23:38.7924245Z Module                  Size  Used by
2025-05-07T20:23:38.7924558Z xt_nat                 16384  0
2025-05-07T20:23:38.7924832Z nvidia_modeset       1716224  0
2025-05-07T20:23:38.7925189Z video                  65536  1 nvidia_modeset
2025-05-07T20:23:38.7925556Z wmi                    36864  1 video
2025-05-07T20:23:38.7925830Z nvidia_uvm           1884160  0
2025-05-07T20:23:38.7926130Z nvidia              11583488  7 nvidia_uvm,nvidia_modeset
2025-05-07T20:23:38.7926451Z drm                   602112  1 nvidia
2025-05-07T20:23:38.7926758Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:23:38.7927119Z backlight              24576  3 video,drm,nvidia_modeset
2025-05-07T20:23:38.7927458Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:23:38.7927742Z veth                   36864  0
2025-05-07T20:23:38.7927997Z xt_conntrack           16384  1
2025-05-07T20:23:38.7928249Z nft_chain_nat          16384  3
2025-05-07T20:23:38.7928533Z xt_MASQUERADE          20480  1
2025-05-07T20:23:38.7928845Z nf_nat                 57344  3 xt_nat,nft_chain_nat,xt_MASQUERADE
2025-05-07T20:23:38.7929186Z nf_conntrack_netlink    57344  0
2025-05-07T20:23:38.7930053Z nf_conntrack          184320  5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:23:38.7930506Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:23:38.7930819Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:23:38.7931120Z xfrm_user              57344  1
2025-05-07T20:23:38.7931378Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:23:38.7931665Z xt_addrtype            16384  2
2025-05-07T20:23:38.7931920Z nft_compat             20480  4
2025-05-07T20:23:38.7932218Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:23:38.7932619Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:23:38.7932984Z br_netfilter           36864  0
2025-05-07T20:23:38.7933340Z bridge                323584  1 br_netfilter
2025-05-07T20:23:38.7933630Z stp                    16384  1 bridge
2025-05-07T20:23:38.7933912Z llc                    16384  2 bridge,stp
2025-05-07T20:23:38.7934194Z overlay               167936  0
2025-05-07T20:23:38.7934444Z tls                   135168  0
2025-05-07T20:23:38.7934695Z nls_ascii              16384  1
2025-05-07T20:23:38.7934945Z nls_cp437              20480  1
2025-05-07T20:23:38.7935186Z vfat                   24576  1
2025-05-07T20:23:38.7935433Z fat                    86016  1 vfat
2025-05-07T20:23:38.7935699Z sunrpc                696320  1
2025-05-07T20:23:38.7935936Z i8042                  45056  0
2025-05-07T20:23:38.7936176Z ena                   180224  0
2025-05-07T20:23:38.7936426Z serio                  28672  3 i8042
2025-05-07T20:23:38.7936695Z ghash_clmulni_intel    16384  0
2025-05-07T20:23:38.7936955Z button                 24576  0
2025-05-07T20:23:38.7937207Z sch_fq_codel           20480  17
2025-05-07T20:23:38.7937454Z dm_mod                188416  0
2025-05-07T20:23:38.7937705Z fuse                  163840  1
2025-05-07T20:23:38.7937949Z loop                   36864  0
2025-05-07T20:23:38.7938191Z configfs               57344  1
2025-05-07T20:23:38.7938448Z dax                    45056  1 dm_mod
2025-05-07T20:23:38.7938725Z dmi_sysfs              20480  0
2025-05-07T20:23:38.7939124Z crc32_pclmul           16384  0
2025-05-07T20:23:38.7939378Z crc32c_intel           24576  0
2025-05-07T20:23:38.7939630Z efivarfs               24576  1
2025-05-07T20:23:38.7939915Z + modinfo nvidia
2025-05-07T20:23:38.7942891Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:23:38.7943499Z import_ns:      DMA_BUF
2025-05-07T20:23:38.7943759Z alias:          char-major-195-*
2025-05-07T20:23:38.7944029Z version:        570.133.07
2025-05-07T20:23:38.7944278Z supported:      external
2025-05-07T20:23:38.7944522Z license:        Dual MIT/GPL
2025-05-07T20:23:38.7944815Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:23:38.7945215Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:23:38.7945586Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:23:38.7945909Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:23:38.7946261Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:23:38.7946599Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:23:38.7946918Z depends:        i2c-core,drm
2025-05-07T20:23:38.7947202Z retpoline:      Y
2025-05-07T20:23:38.7947424Z name:           nvidia
2025-05-07T20:23:38.7947787Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:23:38.7948258Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:23:38.7948802Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:23:38.7949267Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:23:38.7949575Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:23:38.7949879Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:23:38.7950187Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:23:38.7950491Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:23:38.7950925Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:23:38.7951291Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:23:38.7951778Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:23:38.7952116Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:23:38.7952426Z parm:           NVreg_EnableMSI:int
2025-05-07T20:23:38.7952728Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:23:38.7953093Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:23:38.7953490Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:23:38.7953866Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:23:38.7954280Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:38.7954689Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:23:38.7955186Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:38.7955601Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:23:38.7955944Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:23:38.7956312Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:23:38.7956677Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:23:38.7957009Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:23:38.7957331Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:23:38.7957654Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:23:38.7957977Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:23:38.7958288Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:23:38.7958629Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:23:38.7958990Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:23:38.7959555Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:23:38.7959882Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:23:38.7960221Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:23:38.7960558Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:23:38.7960908Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:23:38.7961391Z parm:           NVreg_RmMsg:charp
2025-05-07T20:23:38.7961680Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:23:38.7962001Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:23:38.7962319Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:23:38.7962633Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:23:38.7962961Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:23:38.7963305Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:23:38.7963651Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:23:38.7963977Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:23:38.7964327Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:23:38.7964658Z parm:           rm_firmware_active:charp
2025-05-07T20:23:38.7964953Z + HAS_NVIDIA_DRIVER=0
2025-05-07T20:23:38.7965195Z ++ command -v nvidia-smi
2025-05-07T20:23:38.7965452Z + '[' -x /usr/bin/nvidia-smi ']'
2025-05-07T20:23:38.7965706Z + set +e
2025-05-07T20:23:38.7966021Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
2025-05-07T20:23:38.8184049Z + INSTALLED_DRIVER_VERSION=570.133.07
2025-05-07T20:23:38.8184443Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:38.8184765Z + '[' 0 -ne 0 ']'
2025-05-07T20:23:38.8185188Z + '[' 570.133.07 '!=' 570.133.07 ']'
2025-05-07T20:23:38.8185510Z + HAS_NVIDIA_DRIVER=1
2025-05-07T20:23:38.8185939Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation'
2025-05-07T20:23:38.8186403Z + set -e
2025-05-07T20:23:38.8186603Z + '[' 1 -eq 0 ']'
2025-05-07T20:23:38.8187004Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation
2025-05-07T20:23:38.8187461Z + post_install_nvidia_driver_common
2025-05-07T20:23:38.8190890Z + sudo modprobe nvidia
2025-05-07T20:23:39.0061911Z + echo 'After installing NVIDIA driver'
2025-05-07T20:23:39.0062756Z + lspci
2025-05-07T20:23:39.0063008Z After installing NVIDIA driver
2025-05-07T20:23:39.0180567Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:39.0181059Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:39.0181599Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:39.0182099Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:23:39.0182570Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:23:39.0183210Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:39.0183937Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:39.0184403Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:23:39.0184798Z + lsmod
2025-05-07T20:23:39.0218563Z Module                  Size  Used by
2025-05-07T20:23:39.0218898Z xt_nat                 16384  0
2025-05-07T20:23:39.0219196Z nvidia_modeset       1716224  0
2025-05-07T20:23:39.0219597Z video                  65536  1 nvidia_modeset
2025-05-07T20:23:39.0219948Z wmi                    36864  1 video
2025-05-07T20:23:39.0220217Z nvidia_uvm           1884160  0
2025-05-07T20:23:39.0220526Z nvidia              11583488  7 nvidia_uvm,nvidia_modeset
2025-05-07T20:23:39.0220853Z drm                   602112  1 nvidia
2025-05-07T20:23:39.0221152Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:23:39.0221540Z backlight              24576  3 video,drm,nvidia_modeset
2025-05-07T20:23:39.0221880Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:23:39.0222163Z veth                   36864  0
2025-05-07T20:23:39.0222409Z xt_conntrack           16384  1
2025-05-07T20:23:39.0222665Z nft_chain_nat          16384  3
2025-05-07T20:23:39.0222918Z xt_MASQUERADE          20480  1
2025-05-07T20:23:39.0223220Z nf_nat                 57344  3 xt_nat,nft_chain_nat,xt_MASQUERADE
2025-05-07T20:23:39.0223566Z nf_conntrack_netlink    57344  0
2025-05-07T20:23:39.0224217Z nf_conntrack          184320  5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:23:39.0224665Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:23:39.0224973Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:23:39.0225263Z xfrm_user              57344  1
2025-05-07T20:23:39.0225524Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:23:39.0225811Z xt_addrtype            16384  2
2025-05-07T20:23:39.0226066Z nft_compat             20480  4
2025-05-07T20:23:39.0226365Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:23:39.0226761Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:23:39.0227126Z br_netfilter           36864  0
2025-05-07T20:23:39.0227401Z bridge                323584  1 br_netfilter
2025-05-07T20:23:39.0227681Z stp                    16384  1 bridge
2025-05-07T20:23:39.0227961Z llc                    16384  2 bridge,stp
2025-05-07T20:23:39.0228241Z overlay               167936  0
2025-05-07T20:23:39.0228487Z tls                   135168  0
2025-05-07T20:23:39.0228784Z nls_ascii              16384  1
2025-05-07T20:23:39.0229034Z nls_cp437              20480  1
2025-05-07T20:23:39.0229273Z vfat                   24576  1
2025-05-07T20:23:39.0229521Z fat                    86016  1 vfat
2025-05-07T20:23:39.0229782Z sunrpc                696320  1
2025-05-07T20:23:39.0230023Z i8042                  45056  0
2025-05-07T20:23:39.0230257Z ena                   180224  0
2025-05-07T20:23:39.0230506Z serio                  28672  3 i8042
2025-05-07T20:23:39.0230783Z ghash_clmulni_intel    16384  0
2025-05-07T20:23:39.0231033Z button                 24576  0
2025-05-07T20:23:39.0231288Z sch_fq_codel           20480  17
2025-05-07T20:23:39.0231545Z dm_mod                188416  0
2025-05-07T20:23:39.0231785Z fuse                  163840  1
2025-05-07T20:23:39.0232031Z loop                   36864  0
2025-05-07T20:23:39.0232435Z configfs               57344  1
2025-05-07T20:23:39.0232684Z dax                    45056  1 dm_mod
2025-05-07T20:23:39.0232964Z dmi_sysfs              20480  0
2025-05-07T20:23:39.0233218Z crc32_pclmul           16384  0
2025-05-07T20:23:39.0233463Z crc32c_intel           24576  0
2025-05-07T20:23:39.0233711Z efivarfs               24576  1
2025-05-07T20:23:39.0233958Z + modinfo nvidia
2025-05-07T20:23:39.0236836Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:23:39.0237430Z import_ns:      DMA_BUF
2025-05-07T20:23:39.0237704Z alias:          char-major-195-*
2025-05-07T20:23:39.0237970Z version:        570.133.07
2025-05-07T20:23:39.0238205Z supported:      external
2025-05-07T20:23:39.0238455Z license:        Dual MIT/GPL
2025-05-07T20:23:39.0238736Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:23:39.0239068Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:23:39.0239373Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:23:39.0239691Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:23:39.0240023Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:23:39.0240346Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:23:39.0240651Z depends:        i2c-core,drm
2025-05-07T20:23:39.0240906Z retpoline:      Y
2025-05-07T20:23:39.0241113Z name:           nvidia
2025-05-07T20:23:39.0241462Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:23:39.0241920Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:23:39.0242353Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:23:39.0242750Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:23:39.0243053Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:23:39.0243347Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:23:39.0243649Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:23:39.0243945Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:23:39.0244247Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:23:39.0244714Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:23:39.0245099Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:23:39.0245420Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:23:39.0245714Z parm:           NVreg_EnableMSI:int
2025-05-07T20:23:39.0246011Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:23:39.0246366Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:23:39.0246757Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:23:39.0247124Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:23:39.0247528Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:39.0247926Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:23:39.0248329Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:39.0248741Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:23:39.0249080Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:23:39.0249433Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:23:39.0249794Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:23:39.0250123Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:23:39.0250484Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:23:39.0250803Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:23:39.0251122Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:23:39.0251425Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:23:39.0251758Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:23:39.0252116Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:23:39.0252437Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:23:39.0252757Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:23:39.0253098Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:23:39.0253622Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:23:39.0253963Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:23:39.0254280Z parm:           NVreg_RmMsg:charp
2025-05-07T20:23:39.0254563Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:23:39.0254876Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:23:39.0255185Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:23:39.0255496Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:23:39.0255817Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:23:39.0256191Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:23:39.0256528Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:23:39.0256843Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:23:39.0257170Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:23:39.0257498Z parm:           rm_firmware_active:charp
2025-05-07T20:23:39.0257779Z + set +e
2025-05-07T20:23:39.0257978Z + nvidia-smi
2025-05-07T20:23:39.0417266Z Wed May  7 20:23:39 2025       
2025-05-07T20:23:39.0417647Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:39.0418303Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:39.0418859Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:39.0419340Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:39.0419855Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:39.0420281Z |                                         |                        |               MIG M. |
2025-05-07T20:23:39.0420607Z |=========================================+========================+======================|
2025-05-07T20:23:39.0555349Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:39.0556076Z |  0%   25C    P8              9W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:39.0556522Z |                                         |                        |                  N/A |
2025-05-07T20:23:39.0556911Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:39.0560390Z                                                                                          
2025-05-07T20:23:39.0560877Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:39.0561386Z | Processes:                                                                              |
2025-05-07T20:23:39.0561816Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:39.0562242Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:39.0562601Z |=========================================================================================|
2025-05-07T20:23:39.0566837Z |  No running processes found                                                             |
2025-05-07T20:23:39.0567441Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:39.3226346Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
2025-05-07T20:23:39.3388909Z NVIDIA A10G
2025-05-07T20:23:39.3430952Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:39.3432550Z + '[' 0 -eq 0 ']'
2025-05-07T20:23:39.3432885Z + echo 'INFO: Ignoring allowed status 0'
2025-05-07T20:23:39.3433287Z + set -e
2025-05-07T20:23:39.3433571Z INFO: Ignoring allowed status 0
2025-05-07T20:23:39.3440758Z == Installing nvidia container toolkit for amzn2023 ==
2025-05-07T20:23:39.3444164Z + sudo yum install -y yum-utils
2025-05-07T20:23:39.8106041Z Last metadata expiration check: 0:09:36 ago on Wed May  7 20:14:03 2025.
2025-05-07T20:23:39.8352298Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed.
2025-05-07T20:23:39.8758444Z Dependencies resolved.
2025-05-07T20:23:39.8931931Z Nothing to do.
2025-05-07T20:23:39.8932273Z Complete!
2025-05-07T20:23:39.9322933Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]]
2025-05-07T20:23:39.9323550Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:39.9324392Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:40.2654948Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:40.3239177Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
2025-05-07T20:23:40.8277145Z nvidia-container-toolkit                         14 kB/s | 833  B     00:00    
2025-05-07T20:23:40.8523894Z Package nvidia-docker2-2.14.0-1.noarch is already installed.
2025-05-07T20:23:40.8529598Z Package nvidia-container-toolkit-1.16.2-1.x86_64 is already installed.
2025-05-07T20:23:40.8919685Z Dependencies resolved.
2025-05-07T20:23:40.9100779Z Nothing to do.
2025-05-07T20:23:40.9101591Z Complete!
2025-05-07T20:23:40.9485332Z + sudo systemctl restart docker
2025-05-07T20:23:43.4732466Z nvidia-persistenced failed to initialize. Check syslog for more details.
2025-05-07T20:23:43.4931186Z Wed May  7 20:23:43 2025       
2025-05-07T20:23:43.4931678Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:43.4932189Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:43.4932677Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:43.4933260Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:43.4933793Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:43.4934248Z |                                         |                        |               MIG M. |
2025-05-07T20:23:43.4934964Z |=========================================+========================+======================|
2025-05-07T20:23:43.5062885Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:43.5063455Z |  0%   26C    P8              9W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:43.5063835Z |                                         |                        |                  N/A |
2025-05-07T20:23:43.5064229Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:43.5067881Z                                                                                          
2025-05-07T20:23:43.5068394Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:43.5068971Z | Processes:                                                                              |
2025-05-07T20:23:43.5069461Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:43.5069859Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:43.5070207Z |=========================================================================================|
2025-05-07T20:23:43.5073173Z |  No running processes found                                                             |
2025-05-07T20:23:43.5073815Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:44.3666886Z Command completed after 1 attempt(s).
2025-05-07T20:23:44.3751344Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info
2025-05-07T20:23:44.3751804Z [36;1m. $PRELUDE; print_system_info; print_ec2_info[0m
2025-05-07T20:23:44.3766017Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:44.3766362Z env:
2025-05-07T20:23:44.3766769Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:44.3767060Z   BUILD_ENV: build_binary
2025-05-07T20:23:44.3767318Z   BUILD_TARGET: genai
2025-05-07T20:23:44.3767554Z   BUILD_VARIANT: cuda
2025-05-07T20:23:44.3767785Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:44.3768046Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:44.3768351Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:44.3768678Z ##[endgroup]
2025-05-07T20:23:44.7163384Z ################################################################################
2025-05-07T20:23:44.7163761Z # Print System Info
2025-05-07T20:23:44.7163978Z #
2025-05-07T20:23:44.7180192Z # [2025-05-07T20:23:44.717Z] + print_system_info 
2025-05-07T20:23:44.7180616Z ################################################################################
2025-05-07T20:23:44.7180889Z 
2025-05-07T20:23:44.7181014Z ################################################################################
2025-05-07T20:23:44.7181356Z [INFO] Printing environment variables ...
2025-05-07T20:23:44.7181675Z + printenv
2025-05-07T20:23:44.7181789Z 
2025-05-07T20:23:44.7218109Z SHELL=/bin/bash
2025-05-07T20:23:44.7218719Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:44.7219258Z BUILD_VARIANT=cuda
2025-05-07T20:23:44.7219965Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_51729451-ec85-4350-b577-611513ad2ac8
2025-05-07T20:23:44.7220682Z GITHUB_ACTION=__run
2025-05-07T20:23:44.7220960Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:44.7221300Z GITHUB_RUN_NUMBER=10601
2025-05-07T20:23:44.7221549Z RUNNER_NAME=i-050728826a2d12e7e
2025-05-07T20:23:44.7221870Z GITHUB_REPOSITORY_OWNER_ID=21003710
2025-05-07T20:23:44.7222173Z PLATFORM_NAME_LC=linux-x86_64
2025-05-07T20:23:44.7222438Z MACHINE_NAME_LC=x86_64
2025-05-07T20:23:44.7222813Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh
2025-05-07T20:23:44.7223225Z GITHUB_TRIGGERING_ACTOR=q10
2025-05-07T20:23:44.7223504Z PRELUDE=.github/scripts/setup_env.bash
2025-05-07T20:23:44.7223802Z GITHUB_REF_TYPE=branch
2025-05-07T20:23:44.7224258Z ***
2025-05-07T20:23:44.7224482Z LOGNAME=ec2-user
2025-05-07T20:23:44.7224719Z GITHUB_REPOSITORY_ID=150154628
2025-05-07T20:23:44.7224973Z ENFORCE_CUDA_DEVICE=1
2025-05-07T20:23:44.7225194Z GITHUB_ACTIONS=true
2025-05-07T20:23:44.7225415Z SYSTEMD_EXEC_PID=55527
2025-05-07T20:23:44.7225690Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:44.7226222Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge
2025-05-07T20:23:44.7226724Z RUNNER_ENVIRONMENT=self-hosted
2025-05-07T20:23:44.7226998Z GITHUB_REF=refs/pull/4066/merge
2025-05-07T20:23:44.7227248Z RUNNER_OS=Linux
2025-05-07T20:23:44.7227469Z GITHUB_REF_PROTECTED=false
2025-05-07T20:23:44.7227710Z HOME=/home/ec2-user
2025-05-07T20:23:44.7227962Z GITHUB_API_URL=https://api.github.com
2025-05-07T20:23:44.7228240Z LANG=C.UTF-8
2025-05-07T20:23:44.7228532Z RUNNER_TRACKING_ID=github_1c494cfc-f0a5-47f9-8949-f21ca7f48e65
2025-05-07T20:23:44.7228887Z RUNNER_ARCH=X64
2025-05-07T20:23:44.7229151Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp
2025-05-07T20:23:44.7229470Z BUILD_TARGET=genai
2025-05-07T20:23:44.7229979Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_51729451-ec85-4350-b577-611513ad2ac8
2025-05-07T20:23:44.7230802Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_51729451-ec85-4350-b577-611513ad2ac8
2025-05-07T20:23:44.7231512Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json
2025-05-07T20:23:44.7232406Z INVOCATION_ID=2d655c04c2b34aecaea14cccbfda1e33
2025-05-07T20:23:44.7232735Z GITHUB_EVENT_NAME=pull_request
2025-05-07T20:23:44.7232993Z GITHUB_RUN_ID=14891846252
2025-05-07T20:23:44.7233551Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_51729451-ec85-4350-b577-611513ad2ac8
2025-05-07T20:23:44.7234303Z BUILD_ENV=build_binary
2025-05-07T20:23:44.7234527Z GITHUB_ACTOR=q10
2025-05-07T20:23:44.7234743Z GITHUB_RUN_ATTEMPT=1
2025-05-07T20:23:44.7234962Z KERN_NAME_LC=linux
2025-05-07T20:23:44.7235181Z BUILD_CUDA_VERSION=12.6.3
2025-05-07T20:23:44.7235478Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql
2025-05-07T20:23:44.7235812Z PLATFORM_NAME=Linux-x86_64
2025-05-07T20:23:44.7236050Z USER=ec2-user
2025-05-07T20:23:44.7236283Z GITHUB_SERVER_URL=https://github.com
2025-05-07T20:23:44.7236569Z SHLVL=1
2025-05-07T20:23:44.7236767Z GITHUB_ACTOR_ID=255046
2025-05-07T20:23:44.7237067Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool
2025-05-07T20:23:44.7237500Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e
2025-05-07T20:23:44.7237853Z GITHUB_REF_NAME=4066/merge
2025-05-07T20:23:44.7238088Z KERN_NAME=Linux
2025-05-07T20:23:44.7238321Z GITHUB_JOB=test_and_publish_artifact
2025-05-07T20:23:44.7238726Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh
2025-05-07T20:23:44.7239139Z GITHUB_REPOSITORY=pytorch/FBGEMM
2025-05-07T20:23:44.7239413Z GITHUB_RETENTION_DAYS=90
2025-05-07T20:23:44.7239652Z JOURNAL_STREAM=8:85602
2025-05-07T20:23:44.7239953Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM
2025-05-07T20:23:44.7240314Z GITHUB_ACTION_REPOSITORY=
2025-05-07T20:23:44.7240621Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
2025-05-07T20:23:44.7240946Z GITHUB_BASE_REF=main
2025-05-07T20:23:44.7241162Z CI=true
2025-05-07T20:23:44.7241372Z GITHUB_REPOSITORY_OWNER=pytorch
2025-05-07T20:23:44.7241651Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6
2025-05-07T20:23:44.7241918Z GITHUB_ACTION_REF=
2025-05-07T20:23:44.7242163Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI
2025-05-07T20:23:44.7242745Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_51729451-ec85-4350-b577-611513ad2ac8
2025-05-07T20:23:44.7243297Z MACHINE_NAME=x86_64
2025-05-07T20:23:44.7243516Z _=/usr/bin/printenv
2025-05-07T20:23:44.7243647Z 
2025-05-07T20:23:44.7243777Z ################################################################################
2025-05-07T20:23:44.7244086Z [INFO] Print ldd version ...
2025-05-07T20:23:44.7244332Z + ldd --version
2025-05-07T20:23:44.7244471Z 
2025-05-07T20:23:44.7244572Z ldd (GNU libc) 2.34
2025-05-07T20:23:44.7244838Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:23:44.7245270Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:23:44.7245791Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:23:44.7246228Z Written by Roland McGrath and Ulrich Drepper.
2025-05-07T20:23:44.7246441Z 
2025-05-07T20:23:44.7246568Z ################################################################################
2025-05-07T20:23:44.7246873Z [INFO] Print CPU info ...
2025-05-07T20:23:44.7247111Z + nproc
2025-05-07T20:23:44.7247221Z 
2025-05-07T20:23:44.7265901Z 16
2025-05-07T20:23:44.7267584Z 
2025-05-07T20:23:44.7267721Z + lscpu
2025-05-07T20:23:44.7399242Z 
2025-05-07T20:23:44.7399417Z Architecture:                         x86_64
2025-05-07T20:23:44.7399810Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:23:44.7400291Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:23:44.7400692Z Byte Order:                           Little Endian
2025-05-07T20:23:44.7400999Z CPU(s):                               16
2025-05-07T20:23:44.7401296Z On-line CPU(s) list:                  0-15
2025-05-07T20:23:44.7401612Z Vendor ID:                            AuthenticAMD
2025-05-07T20:23:44.7401940Z Model name:                           AMD EPYC 7R32
2025-05-07T20:23:44.7402254Z CPU family:                           23
2025-05-07T20:23:44.7404416Z Model:                                49
2025-05-07T20:23:44.7404713Z Thread(s) per core:                   2
2025-05-07T20:23:44.7405001Z Core(s) per socket:                   8
2025-05-07T20:23:44.7405288Z Socket(s):                            1
2025-05-07T20:23:44.7405681Z Stepping:                             0
2025-05-07T20:23:44.7405977Z BogoMIPS:                             5599.99
2025-05-07T20:23:44.7408022Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:44.7410041Z Hypervisor vendor:                    KVM
2025-05-07T20:23:44.7410353Z Virtualization type:                  full
2025-05-07T20:23:44.7410692Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:23:44.7411046Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:23:44.7411409Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:23:44.7411760Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:23:44.7412073Z NUMA node(s):                         1
2025-05-07T20:23:44.7412364Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:23:44.7412698Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:23:44.7413061Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:23:44.7413549Z Vulnerability L1tf:                   Not affected
2025-05-07T20:23:44.7414046Z Vulnerability Mds:                    Not affected
2025-05-07T20:23:44.7414550Z Vulnerability Meltdown:               Not affected
2025-05-07T20:23:44.7415052Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:23:44.7415574Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:23:44.7416336Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:23:44.7417117Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:23:44.7417746Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:23:44.7418412Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:23:44.7419246Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:23:44.7419902Z Vulnerability Srbds:                  Not affected
2025-05-07T20:23:44.7420258Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:23:44.7420572Z 
2025-05-07T20:23:44.7420663Z + cat /proc/cpuinfo
2025-05-07T20:23:44.7420796Z 
2025-05-07T20:23:44.7420886Z processor	: 0
2025-05-07T20:23:44.7421101Z vendor_id	: AuthenticAMD
2025-05-07T20:23:44.7421371Z cpu family	: 23
2025-05-07T20:23:44.7421599Z model		: 49
2025-05-07T20:23:44.7421807Z model name	: AMD EPYC 7R32
2025-05-07T20:23:44.7422049Z stepping	: 0
2025-05-07T20:23:44.7422262Z microcode	: 0x830107f
2025-05-07T20:23:44.7422483Z cpu MHz		: 2077.110
2025-05-07T20:23:44.7422698Z cache size	: 512 KB
2025-05-07T20:23:44.7422914Z physical id	: 0
2025-05-07T20:23:44.7423119Z siblings	: 16
2025-05-07T20:23:44.7423319Z core id		: 0
2025-05-07T20:23:44.7423523Z cpu cores	: 8
2025-05-07T20:23:44.7423720Z apicid		: 0
2025-05-07T20:23:44.7423922Z initial apicid	: 0
2025-05-07T20:23:44.7424144Z fpu		: yes
2025-05-07T20:23:44.7424338Z fpu_exception	: yes
2025-05-07T20:23:44.7424556Z cpuid level	: 13
2025-05-07T20:23:44.7424761Z wp		: yes
2025-05-07T20:23:44.7426812Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:44.7429128Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:44.7429605Z bogomips	: 5599.99
2025-05-07T20:23:44.7429828Z TLB size	: 3072 4K pages
2025-05-07T20:23:44.7430070Z clflush size	: 64
2025-05-07T20:23:44.7430284Z cache_alignment	: 64
2025-05-07T20:23:44.7430566Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:44.7430887Z power management:
2025-05-07T20:23:44.7431018Z 
2025-05-07T20:23:44.7431103Z processor	: 1
2025-05-07T20:23:44.7431321Z vendor_id	: AuthenticAMD
2025-05-07T20:23:44.7431561Z cpu family	: 23
2025-05-07T20:23:44.7431781Z model		: 49
2025-05-07T20:23:44.7432026Z model name	: AMD EPYC 7R32
2025-05-07T20:23:44.7432270Z stepping	: 0
2025-05-07T20:23:44.7432473Z microcode	: 0x830107f
2025-05-07T20:23:44.7432704Z cpu MHz		: 2921.142
2025-05-07T20:23:44.7432919Z cache size	: 512 KB
2025-05-07T20:23:44.7433131Z physical id	: 0
2025-05-07T20:23:44.7433331Z siblings	: 16
2025-05-07T20:23:44.7433535Z core id		: 1
2025-05-07T20:23:44.7433736Z cpu cores	: 8
2025-05-07T20:23:44.7433929Z apicid		: 2
2025-05-07T20:23:44.7434126Z initial apicid	: 2
2025-05-07T20:23:44.7434341Z fpu		: yes
2025-05-07T20:23:44.7434541Z fpu_exception	: yes
2025-05-07T20:23:44.7434758Z cpuid level	: 13
2025-05-07T20:23:44.7434967Z wp		: yes
2025-05-07T20:23:44.7436892Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:44.7439084Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:44.7439571Z bogomips	: 5599.99
2025-05-07T20:23:44.7439799Z TLB size	: 3072 4K pages
2025-05-07T20:23:44.7440031Z clflush size	: 64
2025-05-07T20:23:44.7440245Z cache_alignment	: 64
2025-05-07T20:23:44.7440514Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:44.7440825Z power management:
2025-05-07T20:23:44.7440962Z 
2025-05-07T20:23:44.7441049Z processor	: 2
2025-05-07T20:23:44.7441267Z vendor_id	: AuthenticAMD
2025-05-07T20:23:44.7441507Z cpu family	: 23
2025-05-07T20:23:44.7441706Z model		: 49
2025-05-07T20:23:44.7441918Z model name	: AMD EPYC 7R32
2025-05-07T20:23:44.7442195Z stepping	: 0
2025-05-07T20:23:44.7442424Z microcode	: 0x830107f
2025-05-07T20:23:44.7442660Z cpu MHz		: 2915.953
2025-05-07T20:23:44.7442885Z cache size	: 512 KB
2025-05-07T20:23:44.7443095Z physical id	: 0
2025-05-07T20:23:44.7443311Z siblings	: 16
2025-05-07T20:23:44.7443519Z core id		: 2
2025-05-07T20:23:44.7443720Z cpu cores	: 8
2025-05-07T20:23:44.7443927Z apicid		: 4
2025-05-07T20:23:44.7444134Z initial apicid	: 4
2025-05-07T20:23:44.7444343Z fpu		: yes
2025-05-07T20:23:44.7444548Z fpu_exception	: yes
2025-05-07T20:23:44.7444778Z cpuid level	: 13
2025-05-07T20:23:44.7444979Z wp		: yes
2025-05-07T20:23:44.7446981Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:44.7449233Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:44.7449718Z bogomips	: 5599.99
2025-05-07T20:23:44.7449941Z TLB size	: 3072 4K pages
2025-05-07T20:23:44.7450174Z clflush size	: 64
2025-05-07T20:23:44.7450398Z cache_alignment	: 64
2025-05-07T20:23:44.7450672Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:44.7450974Z power management:
2025-05-07T20:23:44.7451111Z 
2025-05-07T20:23:44.7451197Z processor	: 3
2025-05-07T20:23:44.7451437Z vendor_id	: AuthenticAMD
2025-05-07T20:23:44.7451694Z cpu family	: 23
2025-05-07T20:23:44.7451900Z model		: 49
2025-05-07T20:23:44.7452110Z model name	: AMD EPYC 7R32
2025-05-07T20:23:44.7452348Z stepping	: 0
2025-05-07T20:23:44.7452554Z microcode	: 0x830107f
2025-05-07T20:23:44.7452786Z cpu MHz		: 3301.992
2025-05-07T20:23:44.7452994Z cache size	: 512 KB
2025-05-07T20:23:44.7453310Z physical id	: 0
2025-05-07T20:23:44.7453522Z siblings	: 16
2025-05-07T20:23:44.7453716Z core id		: 3
2025-05-07T20:23:44.7453914Z cpu cores	: 8
2025-05-07T20:23:44.7454113Z apicid		: 6
2025-05-07T20:23:44.7454309Z initial apicid	: 6
2025-05-07T20:23:44.7454522Z fpu		: yes
2025-05-07T20:23:44.7454727Z fpu_exception	: yes
2025-05-07T20:23:44.7454948Z cpuid level	: 13
2025-05-07T20:23:44.7455157Z wp		: yes
2025-05-07T20:23:44.7457088Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:44.7459708Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:44.7460200Z bogomips	: 5599.99
2025-05-07T20:23:44.7460416Z TLB size	: 3072 4K pages
2025-05-07T20:23:44.7460658Z clflush size	: 64
2025-05-07T20:23:44.7460878Z cache_alignment	: 64
2025-05-07T20:23:44.7461146Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:44.7461459Z power management:
2025-05-07T20:23:44.7461590Z 
2025-05-07T20:23:44.7461696Z processor	: 4
2025-05-07T20:23:44.7461907Z vendor_id	: AuthenticAMD
2025-05-07T20:23:44.7462148Z cpu family	: 23
2025-05-07T20:23:44.7462362Z model		: 49
2025-05-07T20:23:44.7509484Z model name	: AMD EPYC 7R32
2025-05-07T20:23:44.7509807Z stepping	: 0
2025-05-07T20:23:44.7510022Z microcode	: 0x830107f
2025-05-07T20:23:44.7510303Z cpu MHz		: 3038.535
2025-05-07T20:23:44.7510554Z cache size	: 512 KB
2025-05-07T20:23:44.7510807Z physical id	: 0
2025-05-07T20:23:44.7511053Z siblings	: 16
2025-05-07T20:23:44.7511324Z core id		: 4
2025-05-07T20:23:44.7511528Z cpu cores	: 8
2025-05-07T20:23:44.7511732Z apicid		: 8
2025-05-07T20:23:44.7511969Z initial apicid	: 8
2025-05-07T20:23:44.7512189Z fpu		: yes
2025-05-07T20:23:44.7512449Z fpu_exception	: yes
2025-05-07T20:23:44.7512668Z cpuid level	: 13
2025-05-07T20:23:44.7512875Z wp		: yes
2025-05-07T20:23:44.7515095Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:44.7517289Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:44.7517884Z bogomips	: 5599.99
2025-05-07T20:23:44.7518105Z TLB size	: 3072 4K pages
2025-05-07T20:23:44.7518347Z clflush size	: 64
2025-05-07T20:23:44.7518552Z cache_alignment	: 64
2025-05-07T20:23:44.7518822Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:44.7519139Z power management:
2025-05-07T20:23:44.7519272Z 
2025-05-07T20:23:44.7519353Z processor	: 5
2025-05-07T20:23:44.7519570Z vendor_id	: AuthenticAMD
2025-05-07T20:23:44.7519809Z cpu family	: 23
2025-05-07T20:23:44.7520017Z model		: 49
2025-05-07T20:23:44.7520223Z model name	: AMD EPYC 7R32
2025-05-07T20:23:44.7520468Z stepping	: 0
2025-05-07T20:23:44.7520668Z microcode	: 0x830107f
2025-05-07T20:23:44.7520893Z cpu MHz		: 3302.234
2025-05-07T20:23:44.7521100Z cache size	: 512 KB
2025-05-07T20:23:44.7521309Z physical id	: 0
2025-05-07T20:23:44.7521516Z siblings	: 16
2025-05-07T20:23:44.7521713Z core id		: 5
2025-05-07T20:23:44.7521945Z cpu cores	: 8
2025-05-07T20:23:44.7522160Z apicid		: 10
2025-05-07T20:23:44.7522356Z initial apicid	: 10
2025-05-07T20:23:44.7522560Z fpu		: yes
2025-05-07T20:23:44.7522759Z fpu_exception	: yes
2025-05-07T20:23:44.7522977Z cpuid level	: 13
2025-05-07T20:23:44.7523177Z wp		: yes
2025-05-07T20:23:44.7525088Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:44.7527255Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:44.7527734Z bogomips	: 5599.99
2025-05-07T20:23:44.7527948Z TLB size	: 3072 4K pages
2025-05-07T20:23:44.7528177Z clflush size	: 64
2025-05-07T20:23:44.7528388Z cache_alignment	: 64
2025-05-07T20:23:44.7528654Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:44.7528958Z power management:
2025-05-07T20:23:44.7529094Z 
2025-05-07T20:23:44.7529179Z processor	: 6
2025-05-07T20:23:44.7529392Z vendor_id	: AuthenticAMD
2025-05-07T20:23:44.7529622Z cpu family	: 23
2025-05-07T20:23:44.7529825Z model		: 49
2025-05-07T20:23:44.7530035Z model name	: AMD EPYC 7R32
2025-05-07T20:23:44.7530276Z stepping	: 0
2025-05-07T20:23:44.7530479Z microcode	: 0x830107f
2025-05-07T20:23:44.7530702Z cpu MHz		: 2144.336
2025-05-07T20:23:44.7530907Z cache size	: 512 KB
2025-05-07T20:23:44.7531122Z physical id	: 0
2025-05-07T20:23:44.7531331Z siblings	: 16
2025-05-07T20:23:44.7531523Z core id		: 6
2025-05-07T20:23:44.7531722Z cpu cores	: 8
2025-05-07T20:23:44.7531919Z apicid		: 12
2025-05-07T20:23:44.7532131Z initial apicid	: 12
2025-05-07T20:23:44.7532336Z fpu		: yes
2025-05-07T20:23:44.7532531Z fpu_exception	: yes
2025-05-07T20:23:44.7532755Z cpuid level	: 13
2025-05-07T20:23:44.7532957Z wp		: yes
2025-05-07T20:23:44.7535003Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:44.7537162Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:44.7537641Z bogomips	: 5599.99
2025-05-07T20:23:44.7537925Z TLB size	: 3072 4K pages
2025-05-07T20:23:44.7538162Z clflush size	: 64
2025-05-07T20:23:44.7538376Z cache_alignment	: 64
2025-05-07T20:23:44.7538637Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:44.7538949Z power management:
2025-05-07T20:23:44.7539085Z 
2025-05-07T20:23:44.7539213Z processor	: 7
2025-05-07T20:23:44.7539501Z vendor_id	: AuthenticAMD
2025-05-07T20:23:44.7539764Z cpu family	: 23
2025-05-07T20:23:44.7539966Z model		: 49
2025-05-07T20:23:44.7540168Z model name	: AMD EPYC 7R32
2025-05-07T20:23:44.7540400Z stepping	: 0
2025-05-07T20:23:44.7540601Z microcode	: 0x830107f
2025-05-07T20:23:44.7540819Z cpu MHz		: 2133.022
2025-05-07T20:23:44.7541030Z cache size	: 512 KB
2025-05-07T20:23:44.7541240Z physical id	: 0
2025-05-07T20:23:44.7541448Z siblings	: 16
2025-05-07T20:23:44.7541643Z core id		: 7
2025-05-07T20:23:44.7541869Z cpu cores	: 8
2025-05-07T20:23:44.7542094Z apicid		: 14
2025-05-07T20:23:44.7542291Z initial apicid	: 14
2025-05-07T20:23:44.7542502Z fpu		: yes
2025-05-07T20:23:44.7542709Z fpu_exception	: yes
2025-05-07T20:23:44.7542909Z cpuid level	: 13
2025-05-07T20:23:44.7543111Z wp		: yes
2025-05-07T20:23:44.7545012Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:44.7547168Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:44.7547633Z bogomips	: 5599.99
2025-05-07T20:23:44.7547856Z TLB size	: 3072 4K pages
2025-05-07T20:23:44.7548094Z clflush size	: 64
2025-05-07T20:23:44.7548307Z cache_alignment	: 64
2025-05-07T20:23:44.7548567Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:44.7548883Z power management:
2025-05-07T20:23:44.7549010Z 
2025-05-07T20:23:44.7549094Z processor	: 8
2025-05-07T20:23:44.7549301Z vendor_id	: AuthenticAMD
2025-05-07T20:23:44.7549533Z cpu family	: 23
2025-05-07T20:23:44.7549732Z model		: 49
2025-05-07T20:23:44.7549930Z model name	: AMD EPYC 7R32
2025-05-07T20:23:44.7550164Z stepping	: 0
2025-05-07T20:23:44.7550369Z microcode	: 0x830107f
2025-05-07T20:23:44.7550580Z cpu MHz		: 3217.811
2025-05-07T20:23:44.7550779Z cache size	: 512 KB
2025-05-07T20:23:44.7550996Z physical id	: 0
2025-05-07T20:23:44.7551206Z siblings	: 16
2025-05-07T20:23:44.7551414Z core id		: 0
2025-05-07T20:23:44.7551616Z cpu cores	: 8
2025-05-07T20:23:44.7551805Z apicid		: 1
2025-05-07T20:23:44.7552032Z initial apicid	: 1
2025-05-07T20:23:44.7552256Z fpu		: yes
2025-05-07T20:23:44.7552451Z fpu_exception	: yes
2025-05-07T20:23:44.7552667Z cpuid level	: 13
2025-05-07T20:23:44.7552866Z wp		: yes
2025-05-07T20:23:44.7554760Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:44.7557040Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:44.7557508Z bogomips	: 5599.99
2025-05-07T20:23:44.7557720Z TLB size	: 3072 4K pages
2025-05-07T20:23:44.7557953Z clflush size	: 64
2025-05-07T20:23:44.7558165Z cache_alignment	: 64
2025-05-07T20:23:44.7558503Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:44.7558807Z power management:
2025-05-07T20:23:44.7558939Z 
2025-05-07T20:23:44.7559019Z processor	: 9
2025-05-07T20:23:44.7560369Z vendor_id	: AuthenticAMD
2025-05-07T20:23:44.7560618Z cpu family	: 23
2025-05-07T20:23:44.7560820Z model		: 49
2025-05-07T20:23:44.7561018Z model name	: AMD EPYC 7R32
2025-05-07T20:23:44.7561255Z stepping	: 0
2025-05-07T20:23:44.7561458Z microcode	: 0x830107f
2025-05-07T20:23:44.7561676Z cpu MHz		: 2209.047
2025-05-07T20:23:44.7561890Z cache size	: 512 KB
2025-05-07T20:23:44.7562102Z physical id	: 0
2025-05-07T20:23:44.7562300Z siblings	: 16
2025-05-07T20:23:44.7562491Z core id		: 1
2025-05-07T20:23:44.7562691Z cpu cores	: 8
2025-05-07T20:23:44.7562886Z apicid		: 3
2025-05-07T20:23:44.7563073Z initial apicid	: 3
2025-05-07T20:23:44.7563282Z fpu		: yes
2025-05-07T20:23:44.7563477Z fpu_exception	: yes
2025-05-07T20:23:44.7563684Z cpuid level	: 13
2025-05-07T20:23:44.7563886Z wp		: yes
2025-05-07T20:23:44.7565784Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:44.7567949Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:44.7568415Z bogomips	: 5599.99
2025-05-07T20:23:44.7568633Z TLB size	: 3072 4K pages
2025-05-07T20:23:44.7568862Z clflush size	: 64
2025-05-07T20:23:44.7569067Z cache_alignment	: 64
2025-05-07T20:23:44.7569325Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:44.7569636Z power management:
2025-05-07T20:23:44.7569762Z 
2025-05-07T20:23:44.7569848Z processor	: 10
2025-05-07T20:23:44.7570056Z vendor_id	: AuthenticAMD
2025-05-07T20:23:44.7570291Z cpu family	: 23
2025-05-07T20:23:44.7570492Z model		: 49
2025-05-07T20:23:44.7570689Z model name	: AMD EPYC 7R32
2025-05-07T20:23:44.7570931Z stepping	: 0
2025-05-07T20:23:44.7571130Z microcode	: 0x830107f
2025-05-07T20:23:44.7571343Z cpu MHz		: 2969.520
2025-05-07T20:23:44.7571553Z cache size	: 512 KB
2025-05-07T20:23:44.7571783Z physical id	: 0
2025-05-07T20:23:44.7572007Z siblings	: 16
2025-05-07T20:23:44.7572201Z core id		: 2
2025-05-07T20:23:44.7572394Z cpu cores	: 8
2025-05-07T20:23:44.7572582Z apicid		: 5
2025-05-07T20:23:44.7572789Z initial apicid	: 5
2025-05-07T20:23:44.7572994Z fpu		: yes
2025-05-07T20:23:44.7573245Z fpu_exception	: yes
2025-05-07T20:23:44.7573454Z cpuid level	: 13
2025-05-07T20:23:44.7573653Z wp		: yes
2025-05-07T20:23:44.7575544Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:44.7577703Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:44.7578173Z bogomips	: 5599.99
2025-05-07T20:23:44.7578570Z TLB size	: 3072 4K pages
2025-05-07T20:23:44.7578810Z clflush size	: 64
2025-05-07T20:23:44.7579015Z cache_alignment	: 64
2025-05-07T20:23:44.7579278Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:44.7579585Z power management:
2025-05-07T20:23:44.7579710Z 
2025-05-07T20:23:44.7579987Z processor	: 11
2025-05-07T20:23:44.7580202Z vendor_id	: AuthenticAMD
2025-05-07T20:23:44.7580433Z cpu family	: 23
2025-05-07T20:23:44.7580639Z model		: 49
2025-05-07T20:23:44.7580841Z model name	: AMD EPYC 7R32
2025-05-07T20:23:44.7581072Z stepping	: 0
2025-05-07T20:23:44.7581279Z microcode	: 0x830107f
2025-05-07T20:23:44.7581501Z cpu MHz		: 3299.173
2025-05-07T20:23:44.7581702Z cache size	: 512 KB
2025-05-07T20:23:44.7581913Z physical id	: 0
2025-05-07T20:23:44.7582152Z siblings	: 16
2025-05-07T20:23:44.7582361Z core id		: 3
2025-05-07T20:23:44.7582554Z cpu cores	: 8
2025-05-07T20:23:44.7582752Z apicid		: 7
2025-05-07T20:23:44.7582939Z initial apicid	: 7
2025-05-07T20:23:44.7583153Z fpu		: yes
2025-05-07T20:23:44.7583345Z fpu_exception	: yes
2025-05-07T20:23:44.7583555Z cpuid level	: 13
2025-05-07T20:23:44.7583754Z wp		: yes
2025-05-07T20:23:44.7585647Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:44.7587805Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:44.7588274Z bogomips	: 5599.99
2025-05-07T20:23:44.7588479Z TLB size	: 3072 4K pages
2025-05-07T20:23:44.7588718Z clflush size	: 64
2025-05-07T20:23:44.7588926Z cache_alignment	: 64
2025-05-07T20:23:44.7589183Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:44.7589491Z power management:
2025-05-07T20:23:44.7589617Z 
2025-05-07T20:23:44.7589705Z processor	: 12
2025-05-07T20:23:44.7589910Z vendor_id	: AuthenticAMD
2025-05-07T20:23:44.7590148Z cpu family	: 23
2025-05-07T20:23:44.7590346Z model		: 49
2025-05-07T20:23:44.7590546Z model name	: AMD EPYC 7R32
2025-05-07T20:23:44.7590775Z stepping	: 0
2025-05-07T20:23:44.7590980Z microcode	: 0x830107f
2025-05-07T20:23:44.7591205Z cpu MHz		: 3138.651
2025-05-07T20:23:44.7591408Z cache size	: 512 KB
2025-05-07T20:23:44.7591612Z physical id	: 0
2025-05-07T20:23:44.7591817Z siblings	: 16
2025-05-07T20:23:44.7592009Z core id		: 4
2025-05-07T20:23:44.7592199Z cpu cores	: 8
2025-05-07T20:23:44.7592390Z apicid		: 9
2025-05-07T20:23:44.7592574Z initial apicid	: 9
2025-05-07T20:23:44.7592776Z fpu		: yes
2025-05-07T20:23:44.7592973Z fpu_exception	: yes
2025-05-07T20:23:44.7593187Z cpuid level	: 13
2025-05-07T20:23:44.7593392Z wp		: yes
2025-05-07T20:23:44.7595292Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:44.7597445Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:44.7597907Z bogomips	: 5599.99
2025-05-07T20:23:44.7598118Z TLB size	: 3072 4K pages
2025-05-07T20:23:44.7598349Z clflush size	: 64
2025-05-07T20:23:44.7598560Z cache_alignment	: 64
2025-05-07T20:23:44.7598918Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:44.7599227Z power management:
2025-05-07T20:23:44.7599353Z 
2025-05-07T20:23:44.7599438Z processor	: 13
2025-05-07T20:23:44.7599640Z vendor_id	: AuthenticAMD
2025-05-07T20:23:44.7599868Z cpu family	: 23
2025-05-07T20:23:44.7600152Z model		: 49
2025-05-07T20:23:44.7600345Z model name	: AMD EPYC 7R32
2025-05-07T20:23:44.7600582Z stepping	: 0
2025-05-07T20:23:44.7600784Z microcode	: 0x830107f
2025-05-07T20:23:44.7600996Z cpu MHz		: 3259.052
2025-05-07T20:23:44.7601204Z cache size	: 512 KB
2025-05-07T20:23:44.7601415Z physical id	: 0
2025-05-07T20:23:44.7601612Z siblings	: 16
2025-05-07T20:23:44.7601803Z core id		: 5
2025-05-07T20:23:44.7602003Z cpu cores	: 8
2025-05-07T20:23:44.7602194Z apicid		: 11
2025-05-07T20:23:44.7602397Z initial apicid	: 11
2025-05-07T20:23:44.7602605Z fpu		: yes
2025-05-07T20:23:44.7602796Z fpu_exception	: yes
2025-05-07T20:23:44.7603008Z cpuid level	: 13
2025-05-07T20:23:44.7603211Z wp		: yes
2025-05-07T20:23:44.7605125Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:44.7607283Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:44.7607748Z bogomips	: 5599.99
2025-05-07T20:23:44.7607963Z TLB size	: 3072 4K pages
2025-05-07T20:23:44.7608192Z clflush size	: 64
2025-05-07T20:23:44.7608398Z cache_alignment	: 64
2025-05-07T20:23:44.7608663Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:44.7608982Z power management:
2025-05-07T20:23:44.7609110Z 
2025-05-07T20:23:44.7609190Z processor	: 14
2025-05-07T20:23:44.7609399Z vendor_id	: AuthenticAMD
2025-05-07T20:23:44.7609633Z cpu family	: 23
2025-05-07T20:23:44.7609829Z model		: 49
2025-05-07T20:23:44.7610030Z model name	: AMD EPYC 7R32
2025-05-07T20:23:44.7610266Z stepping	: 0
2025-05-07T20:23:44.7610462Z microcode	: 0x830107f
2025-05-07T20:23:44.7610684Z cpu MHz		: 2231.959
2025-05-07T20:23:44.7610890Z cache size	: 512 KB
2025-05-07T20:23:44.7611095Z physical id	: 0
2025-05-07T20:23:44.7611301Z siblings	: 16
2025-05-07T20:23:44.7611499Z core id		: 6
2025-05-07T20:23:44.7611690Z cpu cores	: 8
2025-05-07T20:23:44.7611914Z apicid		: 13
2025-05-07T20:23:44.7612137Z initial apicid	: 13
2025-05-07T20:23:44.7612345Z fpu		: yes
2025-05-07T20:23:44.7612537Z fpu_exception	: yes
2025-05-07T20:23:44.7612748Z cpuid level	: 13
2025-05-07T20:23:44.7612946Z wp		: yes
2025-05-07T20:23:44.7614883Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:44.7617038Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:44.7617506Z bogomips	: 5599.99
2025-05-07T20:23:44.7617723Z TLB size	: 3072 4K pages
2025-05-07T20:23:44.7617945Z clflush size	: 64
2025-05-07T20:23:44.7618157Z cache_alignment	: 64
2025-05-07T20:23:44.7618429Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:44.7618728Z power management:
2025-05-07T20:23:44.7618860Z 
2025-05-07T20:23:44.7619037Z processor	: 15
2025-05-07T20:23:44.7619246Z vendor_id	: AuthenticAMD
2025-05-07T20:23:44.7619474Z cpu family	: 23
2025-05-07T20:23:44.7619671Z model		: 49
2025-05-07T20:23:44.7619872Z model name	: AMD EPYC 7R32
2025-05-07T20:23:44.7620109Z stepping	: 0
2025-05-07T20:23:44.7620306Z microcode	: 0x830107f
2025-05-07T20:23:44.7620606Z cpu MHz		: 3056.931
2025-05-07T20:23:44.7620812Z cache size	: 512 KB
2025-05-07T20:23:44.7621016Z physical id	: 0
2025-05-07T20:23:44.7621215Z siblings	: 16
2025-05-07T20:23:44.7621414Z core id		: 7
2025-05-07T20:23:44.7621600Z cpu cores	: 8
2025-05-07T20:23:44.7621793Z apicid		: 15
2025-05-07T20:23:44.7622023Z initial apicid	: 15
2025-05-07T20:23:44.7622247Z fpu		: yes
2025-05-07T20:23:44.7622441Z fpu_exception	: yes
2025-05-07T20:23:44.7622650Z cpuid level	: 13
2025-05-07T20:23:44.7622845Z wp		: yes
2025-05-07T20:23:44.7624745Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:44.7626908Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:44.7627372Z bogomips	: 5599.99
2025-05-07T20:23:44.7627580Z TLB size	: 3072 4K pages
2025-05-07T20:23:44.7627810Z clflush size	: 64
2025-05-07T20:23:44.7628021Z cache_alignment	: 64
2025-05-07T20:23:44.7628286Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:44.7628589Z power management:
2025-05-07T20:23:44.7628724Z 
2025-05-07T20:23:44.7628728Z 
2025-05-07T20:23:44.7628854Z ################################################################################
2025-05-07T20:23:44.7629157Z [INFO] Print PCI info ...
2025-05-07T20:23:44.7629391Z + lspci -v
2025-05-07T20:23:44.7629506Z 
2025-05-07T20:23:44.7629733Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:44.7630113Z 	Subsystem: Amazon.com, Inc. Device 1237
2025-05-07T20:23:44.7630423Z 	Flags: bus master, medium devsel, latency 0
2025-05-07T20:23:44.7630624Z 
2025-05-07T20:23:44.7630815Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:44.7631183Z 	Physical Slot: 1
2025-05-07T20:23:44.7631420Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:44.7631617Z 
2025-05-07T20:23:44.7631858Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:44.7632273Z 	Physical Slot: 1
2025-05-07T20:23:44.7632522Z 	Flags: bus master, fast devsel, latency 0, IRQ 9
2025-05-07T20:23:44.7632742Z 
2025-05-07T20:23:44.7633005Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller])
2025-05-07T20:23:44.7633437Z 	Physical Slot: 3
2025-05-07T20:23:44.7633670Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:44.7634002Z 	Memory at c1000000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:44.7634351Z 	Expansion ROM at 000c0000 [disabled] [size=128K]
2025-05-07T20:23:44.7634565Z 
2025-05-07T20:23:44.7634858Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:44.7635356Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:44.7635635Z 	Physical Slot: 4
2025-05-07T20:23:44.7635882Z 	Flags: bus master, fast devsel, latency 0, IRQ 11
2025-05-07T20:23:44.7636251Z 	Memory at c1808000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:44.7636600Z 	Capabilities: <access denied>
2025-05-07T20:23:44.7636878Z 	Kernel driver in use: nvme
2025-05-07T20:23:44.7637040Z 
2025-05-07T20:23:44.7637343Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:44.7637813Z 	Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:44.7638157Z 	Physical Slot: 5
2025-05-07T20:23:44.7638387Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:44.7638740Z 	Memory at c1804000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:44.7639216Z 	Memory at c1400000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:44.7639529Z 	Capabilities: <access denied>
2025-05-07T20:23:44.7639788Z 	Kernel driver in use: ena
2025-05-07T20:23:44.7640025Z 	Kernel modules: ena
2025-05-07T20:23:44.7640161Z 
2025-05-07T20:23:44.7640327Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:44.7640690Z 	Subsystem: NVIDIA Corporation Device 152f
2025-05-07T20:23:44.7640972Z 	Physical Slot: 30
2025-05-07T20:23:44.7641223Z 	Flags: bus master, fast devsel, latency 0, IRQ 10
2025-05-07T20:23:44.7641585Z 	Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
2025-05-07T20:23:44.7642005Z 	Memory at 1800000000 (64-bit, prefetchable) [size=32G]
2025-05-07T20:23:44.7642383Z 	Memory at 1040000000 (64-bit, prefetchable) [size=32M]
2025-05-07T20:23:44.7642702Z 	Capabilities: <access denied>
2025-05-07T20:23:44.7642964Z 	Kernel driver in use: nvidia
2025-05-07T20:23:44.7643208Z 	Kernel modules: nvidia
2025-05-07T20:23:44.7643357Z 
2025-05-07T20:23:44.7643653Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:44.7644144Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:44.7644434Z 	Physical Slot: 31
2025-05-07T20:23:44.7644672Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:44.7645019Z 	Memory at c1800000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:44.7645392Z 	Memory at c180c000 (32-bit, prefetchable) [size=8K]
2025-05-07T20:23:44.7645713Z 	Capabilities: <access denied>
2025-05-07T20:23:44.7645965Z 	Kernel driver in use: nvme
2025-05-07T20:23:44.7646122Z 
2025-05-07T20:23:44.7646126Z 
2025-05-07T20:23:44.7646245Z ################################################################################
2025-05-07T20:23:44.7646563Z [INFO] Print Linux distribution info ...
2025-05-07T20:23:44.7646841Z + uname -a
2025-05-07T20:23:44.7646951Z 
2025-05-07T20:23:44.7647345Z Linux ip-10-0-27-143.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
2025-05-07T20:23:44.7647833Z 
2025-05-07T20:23:44.7647915Z + uname -m
2025-05-07T20:23:44.7648028Z 
2025-05-07T20:23:44.7648107Z x86_64
2025-05-07T20:23:44.7654417Z 
2025-05-07T20:23:44.7654520Z + cat /proc/version
2025-05-07T20:23:44.7654670Z 
2025-05-07T20:23:44.7655204Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025
2025-05-07T20:23:44.7655821Z 
2025-05-07T20:23:44.7655909Z + cat /etc/os-release
2025-05-07T20:23:44.7656051Z 
2025-05-07T20:23:44.7656149Z NAME="Amazon Linux"
2025-05-07T20:23:44.7656351Z VERSION="2023"
2025-05-07T20:23:44.7656549Z ID="amzn"
2025-05-07T20:23:44.7656734Z ID_LIKE="fedora"
2025-05-07T20:23:44.7656931Z VERSION_ID="2023"
2025-05-07T20:23:44.7657161Z PLATFORM_ID="platform:al2023"
2025-05-07T20:23:44.7657436Z PRETTY_NAME="Amazon Linux 2023.6.20250317"
2025-05-07T20:23:44.7657726Z ANSI_COLOR="0;33"
2025-05-07T20:23:44.7657962Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
2025-05-07T20:23:44.7658348Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
2025-05-07T20:23:44.7658774Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
2025-05-07T20:23:44.7659174Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
2025-05-07T20:23:44.7659932Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
2025-05-07T20:23:44.7660305Z VENDOR_NAME="AWS"
2025-05-07T20:23:44.7660539Z VENDOR_URL="https://aws.amazon.com/"
2025-05-07T20:23:44.7660817Z SUPPORT_END="2029-06-30"
2025-05-07T20:23:44.7660967Z 
2025-05-07T20:23:44.7661269Z ################################################################################
2025-05-07T20:23:44.7661573Z # Print EC2 Instance Info
2025-05-07T20:23:44.7661803Z #
2025-05-07T20:23:44.7662006Z # [2025-05-07T20:23:44.761Z] + print_ec2_info 
2025-05-07T20:23:44.7662308Z ################################################################################
2025-05-07T20:23:44.7662622Z 
2025-05-07T20:23:44.7736146Z ami-id: ami-071226ecf16aa7d96
2025-05-07T20:23:44.7848301Z instance-id: i-050728826a2d12e7e
2025-05-07T20:23:44.7963641Z instance-type: g5.4xlarge
2025-05-07T20:23:44.8004872Z ##[group]Run . $PRELUDE; print_gpu_info
2025-05-07T20:23:44.8005230Z [36;1m. $PRELUDE; print_gpu_info[0m
2025-05-07T20:23:44.8015189Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:44.8015547Z env:
2025-05-07T20:23:44.8015773Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:44.8016084Z   BUILD_ENV: build_binary
2025-05-07T20:23:44.8016339Z   BUILD_TARGET: genai
2025-05-07T20:23:44.8016571Z   BUILD_VARIANT: cuda
2025-05-07T20:23:44.8016813Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:44.8017280Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:44.8017669Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:44.8018001Z ##[endgroup]
2025-05-07T20:23:45.1382029Z ################################################################################
2025-05-07T20:23:45.1382482Z [INFO] Printing general display info ...
2025-05-07T20:23:45.1396928Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:45.2614753Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:45.2623072Z /usr/bin/sudo
2025-05-07T20:23:45.2634488Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:45.2644762Z /usr/bin/yum
2025-05-07T20:23:45.2645766Z [INSTALL] Updating system repositories ...
2025-05-07T20:23:45.2667473Z [EXEC] [ATTEMPT 0/3]    + sudo yum update -y
2025-05-07T20:23:45.7472499Z Last metadata expiration check: 0:00:05 ago on Wed May  7 20:23:40 2025.
2025-05-07T20:23:45.8151820Z ================================================================================
2025-05-07T20:23:45.8152533Z WARNING:
2025-05-07T20:23:45.8153183Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:45.8153821Z 
2025-05-07T20:23:45.8154102Z   Available Versions:
2025-05-07T20:23:45.8154541Z 
2025-05-07T20:23:45.8154752Z   Version 2023.7.20250331:
2025-05-07T20:23:45.8155360Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:45.8155849Z 
2025-05-07T20:23:45.8156099Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:45.8156520Z 
2025-05-07T20:23:45.8156689Z     Release notes:
2025-05-07T20:23:45.8157480Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:45.8158197Z 
2025-05-07T20:23:45.8158382Z   Version 2023.7.20250414:
2025-05-07T20:23:45.8158970Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:45.8159904Z 
2025-05-07T20:23:45.8160134Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:45.8160547Z 
2025-05-07T20:23:45.8160724Z     Release notes:
2025-05-07T20:23:45.8161480Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:45.8161990Z 
2025-05-07T20:23:45.8162098Z   Version 2023.7.20250428:
2025-05-07T20:23:45.8162416Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:45.8162658Z 
2025-05-07T20:23:45.8162778Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:45.8162985Z 
2025-05-07T20:23:45.8163069Z     Release notes:
2025-05-07T20:23:45.8163452Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:45.8163810Z 
2025-05-07T20:23:45.8163924Z ================================================================================
2025-05-07T20:23:45.9300753Z Dependencies resolved.
2025-05-07T20:23:45.9586922Z ================================================================================
2025-05-07T20:23:45.9587518Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:45.9588022Z ================================================================================
2025-05-07T20:23:45.9588435Z Upgrading:
2025-05-07T20:23:45.9589015Z  nvidia-container-toolkit      x86_64 1.17.6-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:45.9589598Z  nvidia-container-toolkit-base x86_64 1.17.6-1   nvidia-container-toolkit 5.7 M
2025-05-07T20:23:45.9589978Z 
2025-05-07T20:23:45.9590348Z Transaction Summary
2025-05-07T20:23:45.9590701Z ================================================================================
2025-05-07T20:23:45.9591114Z Upgrade  2 Packages
2025-05-07T20:23:45.9591277Z 
2025-05-07T20:23:45.9591416Z Total download size: 6.9 M
2025-05-07T20:23:45.9591770Z Downloading Packages:
2025-05-07T20:23:46.0240896Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64  19 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:46.0497077Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x  63 MB/s | 5.7 MB     00:00    
2025-05-07T20:23:46.0505963Z --------------------------------------------------------------------------------
2025-05-07T20:23:46.0509178Z Total                                            76 MB/s | 6.9 MB     00:00     
2025-05-07T20:23:46.0511759Z Running transaction check
2025-05-07T20:23:46.0611352Z Transaction check succeeded.
2025-05-07T20:23:46.0612606Z Running transaction test
2025-05-07T20:23:46.0908231Z Transaction test succeeded.
2025-05-07T20:23:46.0910981Z Running transaction
2025-05-07T20:23:46.6435460Z   Preparing        :                                                        1/1 
2025-05-07T20:23:46.7495370Z   Upgrading        : nvidia-container-toolkit-base-1.17.6-1.x86_64          1/4 
2025-05-07T20:23:46.7516598Z   Upgrading        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:46.7740714Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:46.7741466Z   Cleanup          : nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:46.7851068Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:46.7876992Z   Cleanup          : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:46.9553518Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               4/4 
2025-05-07T20:23:46.9554322Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               1/4 
2025-05-07T20:23:46.9554965Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:46.9555496Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          3/4 
2025-05-07T20:23:47.0956013Z ================================================================================
2025-05-07T20:23:47.0956942Z WARNING:
2025-05-07T20:23:47.0957592Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:47.0958204Z 
2025-05-07T20:23:47.0958451Z   Available Versions:
2025-05-07T20:23:47.0958806Z 
2025-05-07T20:23:47.0958985Z   Version 2023.7.20250331:
2025-05-07T20:23:47.0960007Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:47.0960499Z 
2025-05-07T20:23:47.0960748Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:47.0961190Z 
2025-05-07T20:23:47.0961362Z     Release notes:
2025-05-07T20:23:47.0962155Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:47.0962565Z 
2025-05-07T20:23:47.0962678Z   Version 2023.7.20250414:
2025-05-07T20:23:47.0962982Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:47.0963234Z 
2025-05-07T20:23:47.0963350Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:47.0963557Z 
2025-05-07T20:23:47.0963643Z     Release notes:
2025-05-07T20:23:47.0964030Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:47.0964383Z 
2025-05-07T20:23:47.0964475Z   Version 2023.7.20250428:
2025-05-07T20:23:47.0964776Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:47.0965017Z 
2025-05-07T20:23:47.0965136Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:47.0965336Z 
2025-05-07T20:23:47.0965703Z     Release notes:
2025-05-07T20:23:47.0966084Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:47.0966443Z 
2025-05-07T20:23:47.0966759Z ================================================================================
2025-05-07T20:23:47.1533094Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:47.1533658Z 
2025-05-07T20:23:47.1533782Z Upgraded:
2025-05-07T20:23:47.1534179Z   nvidia-container-toolkit-1.17.6-1.x86_64                                      
2025-05-07T20:23:47.1534747Z   nvidia-container-toolkit-base-1.17.6-1.x86_64                                 
2025-05-07T20:23:47.1535082Z 
2025-05-07T20:23:47.1535178Z Complete!
2025-05-07T20:23:47.1983858Z [INSTALL] Installing system package(s): hostname lshw ...
2025-05-07T20:23:47.2007898Z [EXEC] [ATTEMPT 0/3]    + sudo yum install -y hostname lshw
2025-05-07T20:23:47.6572650Z Last metadata expiration check: 0:00:07 ago on Wed May  7 20:23:40 2025.
2025-05-07T20:23:47.6811822Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed.
2025-05-07T20:23:47.7216035Z Dependencies resolved.
2025-05-07T20:23:47.7395824Z ================================================================================
2025-05-07T20:23:47.7396932Z  Package    Architecture Version                        Repository         Size
2025-05-07T20:23:47.7397762Z ================================================================================
2025-05-07T20:23:47.7398552Z Installing:
2025-05-07T20:23:47.7399165Z  lshw       x86_64       B.02.19.2-7.amzn2023.0.3       amazonlinux       319 k
2025-05-07T20:23:47.7399819Z 
2025-05-07T20:23:47.7400004Z Transaction Summary
2025-05-07T20:23:47.7400504Z ================================================================================
2025-05-07T20:23:47.7401087Z Install  1 Package
2025-05-07T20:23:47.7401362Z 
2025-05-07T20:23:47.7401560Z Total download size: 319 k
2025-05-07T20:23:47.7402068Z Installed size: 837 k
2025-05-07T20:23:47.7402432Z Downloading Packages:
2025-05-07T20:23:47.8124013Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm        7.1 MB/s | 319 kB     00:00    
2025-05-07T20:23:47.8129431Z --------------------------------------------------------------------------------
2025-05-07T20:23:47.8132153Z Total                                           4.3 MB/s | 319 kB     00:00     
2025-05-07T20:23:47.8288458Z Running transaction check
2025-05-07T20:23:47.8342552Z Transaction check succeeded.
2025-05-07T20:23:47.8342851Z Running transaction test
2025-05-07T20:23:47.8803965Z Transaction test succeeded.
2025-05-07T20:23:47.8807695Z Running transaction
2025-05-07T20:23:47.9827969Z   Preparing        :                                                        1/1 
2025-05-07T20:23:48.0323582Z   Installing       : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:48.1963829Z   Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:48.3271994Z ================================================================================
2025-05-07T20:23:48.3272531Z WARNING:
2025-05-07T20:23:48.3272880Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:48.3273193Z 
2025-05-07T20:23:48.3273319Z   Available Versions:
2025-05-07T20:23:48.3273504Z 
2025-05-07T20:23:48.3273597Z   Version 2023.7.20250331:
2025-05-07T20:23:48.3273903Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:48.3274149Z 
2025-05-07T20:23:48.3274273Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:48.3274488Z 
2025-05-07T20:23:48.3274577Z     Release notes:
2025-05-07T20:23:48.3274984Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:48.3275342Z 
2025-05-07T20:23:48.3275436Z   Version 2023.7.20250414:
2025-05-07T20:23:48.3275734Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:48.3275981Z 
2025-05-07T20:23:48.3276096Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:48.3276567Z 
2025-05-07T20:23:48.3276658Z     Release notes:
2025-05-07T20:23:48.3277035Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:48.3277399Z 
2025-05-07T20:23:48.3277646Z   Version 2023.7.20250428:
2025-05-07T20:23:48.3277951Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:48.3278191Z 
2025-05-07T20:23:48.3278310Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:48.3278511Z 
2025-05-07T20:23:48.3278598Z     Release notes:
2025-05-07T20:23:48.3278981Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:48.3279337Z 
2025-05-07T20:23:48.3279451Z ================================================================================
2025-05-07T20:23:48.3622197Z   Verifying        : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:48.3622531Z 
2025-05-07T20:23:48.3622629Z Installed:
2025-05-07T20:23:48.3622944Z   lshw-B.02.19.2-7.amzn2023.0.3.x86_64                                          
2025-05-07T20:23:48.3623259Z 
2025-05-07T20:23:48.3623345Z Complete!
2025-05-07T20:23:48.4070239Z + hostname
2025-05-07T20:23:48.4070396Z 
2025-05-07T20:23:48.4084312Z ip-10-0-27-143.ec2.internal
2025-05-07T20:23:48.4085791Z 
2025-05-07T20:23:48.4085997Z + sudo lshw -C display
2025-05-07T20:23:48.4086162Z 
2025-05-07T20:23:48.8891098Z   *-display:0 UNCLAIMED
2025-05-07T20:23:48.8891792Z        description: VGA compatible controller
2025-05-07T20:23:48.8892449Z        product: Amazon.com, Inc.
2025-05-07T20:23:48.8892914Z        vendor: Amazon.com, Inc.
2025-05-07T20:23:48.8893276Z        physical id: 3
2025-05-07T20:23:48.8893521Z        bus info: pci@0000:00:03.0
2025-05-07T20:23:48.8893787Z        version: 00
2025-05-07T20:23:48.8894001Z        width: 32 bits
2025-05-07T20:23:48.8894235Z        clock: 33MHz
2025-05-07T20:23:48.8894502Z        capabilities: vga_controller bus_master
2025-05-07T20:23:48.8894827Z        configuration: latency=0
2025-05-07T20:23:48.8895182Z        resources: memory:c1000000-c13fffff memory:c0000-dffff
2025-05-07T20:23:48.8895525Z   *-display:1
2025-05-07T20:23:48.8895758Z        description: 3D controller
2025-05-07T20:23:48.8896053Z        product: GA102GL [A10G]
2025-05-07T20:23:48.8896326Z        vendor: NVIDIA Corporation
2025-05-07T20:23:48.8896604Z        physical id: 1e
2025-05-07T20:23:48.8896843Z        bus info: pci@0000:00:1e.0
2025-05-07T20:23:48.8897107Z        version: a1
2025-05-07T20:23:48.8897331Z        width: 64 bits
2025-05-07T20:23:48.8897549Z        clock: 33MHz
2025-05-07T20:23:48.8897850Z        capabilities: pm pciexpress msix bus_master cap_list
2025-05-07T20:23:48.8898227Z        configuration: driver=nvidia latency=0
2025-05-07T20:23:48.8898834Z        resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff
2025-05-07T20:23:48.8929063Z 
2025-05-07T20:23:48.8929565Z ################################################################################
2025-05-07T20:23:48.8929908Z [INFO] Printing NVIDIA GPU info ...
2025-05-07T20:23:48.9058015Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:48.9243693Z Wed May  7 20:23:48 2025       
2025-05-07T20:23:48.9244065Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:48.9244566Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:48.9245048Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:48.9245533Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:48.9246038Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:48.9246458Z |                                         |                        |               MIG M. |
2025-05-07T20:23:48.9246787Z |=========================================+========================+======================|
2025-05-07T20:23:48.9379335Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:48.9379966Z |  0%   26C    P8              9W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:48.9380336Z |                                         |                        |                  N/A |
2025-05-07T20:23:48.9380723Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:48.9384176Z                                                                                          
2025-05-07T20:23:48.9384575Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:48.9384992Z | Processes:                                                                              |
2025-05-07T20:23:48.9385420Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:48.9385822Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:48.9386169Z |=========================================================================================|
2025-05-07T20:23:48.9389464Z |  No running processes found                                                             |
2025-05-07T20:23:48.9390075Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:49.1815908Z ################################################################################
2025-05-07T20:23:49.1816280Z [INFO] Printing AMD GPU info ...
2025-05-07T20:23:49.1959413Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:49.1960210Z [CHECK] rocminfo not found
2025-05-07T20:23:49.1969760Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:49.1971209Z [CHECK] rocm-smi not found
2025-05-07T20:23:49.2014748Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda
2025-05-07T20:23:49.2015182Z [36;1m. $PRELUDE; setup_miniconda $HOME/miniconda[0m
2025-05-07T20:23:49.2027443Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:49.2027790Z env:
2025-05-07T20:23:49.2028009Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:49.2028308Z   BUILD_ENV: build_binary
2025-05-07T20:23:49.2028556Z   BUILD_TARGET: genai
2025-05-07T20:23:49.2028775Z   BUILD_VARIANT: cuda
2025-05-07T20:23:49.2029012Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:49.2029274Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:49.2029574Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:49.2029905Z ##[endgroup]
2025-05-07T20:23:49.5368837Z ################################################################################
2025-05-07T20:23:49.5369213Z # Setup Miniconda
2025-05-07T20:23:49.5369421Z #
2025-05-07T20:23:49.5383924Z # [2025-05-07T20:23:49.538Z] + setup_miniconda /home/ec2-user/miniconda
2025-05-07T20:23:49.5384340Z ################################################################################
2025-05-07T20:23:49.5384560Z 
2025-05-07T20:23:49.5398802Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:49.6334969Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:49.6335340Z + mkdir -p /home/ec2-user/miniconda
2025-05-07T20:23:49.6335537Z 
2025-05-07T20:23:49.6353673Z 
2025-05-07T20:23:49.6353976Z [SETUP] Downloading the Miniconda installer ...
2025-05-07T20:23:49.6379889Z [EXEC] [ATTEMPT 0/3]    + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
2025-05-07T20:23:50.6243722Z [SETUP] Installing Miniconda ...
2025-05-07T20:23:50.6244110Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u
2025-05-07T20:23:50.6244361Z 
2025-05-07T20:23:50.6390059Z PREFIX=/home/ec2-user/miniconda
2025-05-07T20:23:51.0898913Z Unpacking payload ...
2025-05-07T20:23:51.6080602Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:52.4141639Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:54.5148071Z 
2025-05-07T20:23:54.5148615Z Installing base environment...
2025-05-07T20:23:54.5148856Z 
2025-05-07T20:23:55.6012313Z Preparing transaction: ...working... done
2025-05-07T20:23:58.5303968Z Executing transaction: ...working... done
2025-05-07T20:23:59.1918828Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:59.2792517Z installation finished.
2025-05-07T20:23:59.2801379Z 
2025-05-07T20:23:59.2801608Z + rm -f miniconda.sh
2025-05-07T20:23:59.2801793Z 
2025-05-07T20:23:59.3116007Z 
2025-05-07T20:23:59.3116592Z [SETUP] Reloading the bash configuration ...
2025-05-07T20:23:59.3116962Z + /home/ec2-user/miniconda/bin/conda init bash
2025-05-07T20:23:59.3117196Z 
2025-05-07T20:23:59.6792088Z no change     /home/ec2-user/miniconda/condabin/conda
2025-05-07T20:23:59.6792638Z no change     /home/ec2-user/miniconda/bin/conda
2025-05-07T20:23:59.6793107Z no change     /home/ec2-user/miniconda/bin/conda-env
2025-05-07T20:23:59.6793575Z no change     /home/ec2-user/miniconda/bin/activate
2025-05-07T20:23:59.6793994Z no change     /home/ec2-user/miniconda/bin/deactivate
2025-05-07T20:23:59.6794379Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.sh
2025-05-07T20:23:59.6794805Z no change     /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish
2025-05-07T20:23:59.6795240Z no change     /home/ec2-user/miniconda/shell/condabin/Conda.psm1
2025-05-07T20:23:59.6795690Z no change     /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1
2025-05-07T20:23:59.6796476Z no change     /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh
2025-05-07T20:23:59.6797009Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.csh
2025-05-07T20:23:59.6797372Z modified      /home/ec2-user/.bashrc
2025-05-07T20:23:59.6797560Z 
2025-05-07T20:23:59.6797753Z ==> For changes to take effect, close and re-open your current shell. <==
2025-05-07T20:23:59.6798046Z 
2025-05-07T20:23:59.7434561Z 
2025-05-07T20:23:59.7435047Z + . /home/ec2-user/.bashrc
2025-05-07T20:23:59.7435248Z 
2025-05-07T20:24:00.5767496Z 
2025-05-07T20:24:00.5768085Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ...
2025-05-07T20:24:00.5792476Z [EXEC] [ATTEMPT 0/3]    + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive
2025-05-07T20:24:13.9231589Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:24:15.4947767Z Solving environment: \ | / - \ | / - \ | / - done
2025-05-07T20:24:15.5914357Z 
2025-05-07T20:24:15.5914739Z ## Package Plan ##
2025-05-07T20:24:15.5915115Z 
2025-05-07T20:24:15.5915515Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:24:15.5916096Z 
2025-05-07T20:24:15.5916327Z   added / updated specs:
2025-05-07T20:24:15.5916929Z     - conda-libmamba-solver
2025-05-07T20:24:15.5917438Z     - libarchive
2025-05-07T20:24:15.5917814Z     - libmamba
2025-05-07T20:24:15.5918173Z     - libmambapy
2025-05-07T20:24:15.5918412Z 
2025-05-07T20:24:15.5918420Z 
2025-05-07T20:24:15.5918655Z The following packages will be downloaded:
2025-05-07T20:24:15.5919416Z 
2025-05-07T20:24:15.5919633Z     package                    |            build
2025-05-07T20:24:15.5920191Z     ---------------------------|-----------------
2025-05-07T20:24:15.5920923Z     ca-certificates-2025.4.26  |       hbd8a1cb_0         149 KB  conda-forge
2025-05-07T20:24:15.5921767Z     certifi-2025.4.26          |     pyhd8ed1ab_0         154 KB  conda-forge
2025-05-07T20:24:15.5922533Z     conda-25.3.1               |  py313h78bf25f_1         1.1 MB  conda-forge
2025-05-07T20:24:15.5923366Z     conda-libmamba-solver-25.4.0|     pyhd8ed1ab_0          41 KB  conda-forge
2025-05-07T20:24:15.5924127Z     ------------------------------------------------------------
2025-05-07T20:24:15.5924477Z                                            Total:         1.4 MB
2025-05-07T20:24:15.5924686Z 
2025-05-07T20:24:15.5924797Z The following packages will be UPDATED:
2025-05-07T20:24:15.5925003Z 
2025-05-07T20:24:15.5928662Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:15.5929454Z   conda              pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 
2025-05-07T20:24:15.5929827Z 
2025-05-07T20:24:15.5930049Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:24:15.5930420Z 
2025-05-07T20:24:15.5930736Z   certifi            pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 
2025-05-07T20:24:15.5931508Z   conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 
2025-05-07T20:24:15.5931982Z 
2025-05-07T20:24:15.5931987Z 
2025-05-07T20:24:15.5931991Z 
2025-05-07T20:24:15.5932133Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:15.5932503Z conda-25.3.1         | 1.1 MB    |            |   0% 
2025-05-07T20:24:15.5932718Z 
2025-05-07T20:24:15.5933311Z certifi-2025.4.26    | 154 KB    |            |   0% [A
2025-05-07T20:24:15.5933556Z 
2025-05-07T20:24:15.5936499Z 
2025-05-07T20:24:15.5945944Z ca-certificates-2025 | 149 KB    |            |   0% [A[A
2025-05-07T20:24:15.5946253Z 
2025-05-07T20:24:15.5946259Z 
2025-05-07T20:24:15.5946264Z 
2025-05-07T20:24:15.6428381Z conda-libmamba-solve | 41 KB     |            |   0% [A[A[A
2025-05-07T20:24:15.6428702Z 
2025-05-07T20:24:15.6428708Z 
2025-05-07T20:24:15.6506016Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:24:15.6506297Z 
2025-05-07T20:24:15.6506303Z 
2025-05-07T20:24:15.6598620Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:24:15.6598885Z 
2025-05-07T20:24:15.6598891Z 
2025-05-07T20:24:15.6598897Z 
2025-05-07T20:24:15.6733593Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:24:15.6733900Z 
2025-05-07T20:24:15.6733906Z 
2025-05-07T20:24:15.6733911Z 
2025-05-07T20:24:15.6764410Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:24:15.6768348Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:15.6768593Z 
2025-05-07T20:24:15.6906859Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:24:15.6907131Z 
2025-05-07T20:24:15.7918506Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:24:15.7918897Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:15.7924274Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:15.7924608Z                                                      
2025-05-07T20:24:15.7924815Z 
2025-05-07T20:24:15.7924997Z                                                      [A
2025-05-07T20:24:15.7925205Z 
2025-05-07T20:24:15.7925210Z 
2025-05-07T20:24:15.7925376Z                                                      [A[A
2025-05-07T20:24:15.7925587Z 
2025-05-07T20:24:15.7925595Z 
2025-05-07T20:24:15.7925599Z 
2025-05-07T20:24:15.7925779Z                                                      [A[A[A done
2025-05-07T20:24:15.8928341Z Preparing transaction: | done
2025-05-07T20:24:15.9934503Z Verifying transaction: - done
2025-05-07T20:24:17.2954321Z Executing transaction: | / - \ | / - \ | / - \ | done
2025-05-07T20:24:19.0078145Z [SETUP] Updating Miniconda base packages ...
2025-05-07T20:24:19.0103103Z [EXEC] [ATTEMPT 0/3]    + conda update -n base -c defaults --update-deps -y conda
2025-05-07T20:24:19.9478958Z Channels:
2025-05-07T20:24:19.9479205Z  - defaults
2025-05-07T20:24:19.9479427Z Platform: linux-64
2025-05-07T20:24:21.1477784Z Collecting package metadata (repodata.json): - \ | / - \ | done
2025-05-07T20:24:21.2691233Z Solving environment: - \ Channels:
2025-05-07T20:24:21.2691545Z  - defaults
2025-05-07T20:24:21.2691769Z Platform: linux-64
2025-05-07T20:24:21.5605381Z Collecting package metadata (repodata.json): / - \ | done
2025-05-07T20:24:21.7711340Z Solving environment: - \ | / done
2025-05-07T20:24:21.8577504Z done
2025-05-07T20:24:21.9236868Z 
2025-05-07T20:24:21.9237149Z ## Package Plan ##
2025-05-07T20:24:21.9237310Z 
2025-05-07T20:24:21.9237450Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:24:21.9237684Z 
2025-05-07T20:24:21.9237779Z   added / updated specs:
2025-05-07T20:24:21.9238021Z     - conda
2025-05-07T20:24:21.9238137Z 
2025-05-07T20:24:21.9238153Z 
2025-05-07T20:24:21.9238276Z The following packages will be downloaded:
2025-05-07T20:24:21.9238483Z 
2025-05-07T20:24:21.9238596Z     package                    |            build
2025-05-07T20:24:21.9238914Z     ---------------------------|-----------------
2025-05-07T20:24:21.9239255Z     pip-25.1                   |     pyhc872135_2         1.3 MB
2025-05-07T20:24:21.9239631Z     tzdata-2025b               |       h04d1e81_0         116 KB
2025-05-07T20:24:21.9239990Z     ------------------------------------------------------------
2025-05-07T20:24:21.9240324Z                                            Total:         1.4 MB
2025-05-07T20:24:21.9240528Z 
2025-05-07T20:24:21.9240860Z The following packages will be UPDATED:
2025-05-07T20:24:21.9241078Z 
2025-05-07T20:24:21.9241376Z   pip                pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:21.9241870Z   tzdata                                   2025a-h04d1e81_0 --> 2025b-h04d1e81_0 
2025-05-07T20:24:21.9242116Z 
2025-05-07T20:24:21.9242120Z 
2025-05-07T20:24:21.9242124Z 
2025-05-07T20:24:21.9242266Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:21.9242626Z pip-25.1             | 1.3 MB    |            |   0% 
2025-05-07T20:24:21.9242835Z 
2025-05-07T20:24:21.9849135Z tzdata-2025b         | 116 KB    |            |   0% [A
2025-05-07T20:24:21.9849701Z 
2025-05-07T20:24:22.0556327Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:22.1086045Z pip-25.1             | 1.3 MB    | 1          |   1% 
2025-05-07T20:24:22.1094517Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:22.1095025Z 
2025-05-07T20:24:22.1098060Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:22.1098566Z 
2025-05-07T20:24:22.2120582Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:22.2121389Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:22.2125160Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:22.2125503Z                                                      
2025-05-07T20:24:22.2125703Z 
2025-05-07T20:24:22.2126045Z                                                      [A done
2025-05-07T20:24:22.3128692Z Preparing transaction: \ done
2025-05-07T20:24:22.4134343Z Verifying transaction: / done
2025-05-07T20:24:24.4164362Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:25.0339600Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:24:25.0344348Z + conda clean --packages --tarball -y
2025-05-07T20:24:25.0344568Z 
2025-05-07T20:24:26.0432811Z Will remove 99 (117.8 MB) tarball(s).
2025-05-07T20:24:26.0433366Z Will remove 11 (16.0 MB) package(s).
2025-05-07T20:24:26.1087860Z 
2025-05-07T20:24:26.1096543Z + conda clean --all -y
2025-05-07T20:24:26.1096885Z 
2025-05-07T20:24:26.6507852Z There are no unused tarball(s) to remove.
2025-05-07T20:24:26.6508208Z Will remove 1 index cache(s).
2025-05-07T20:24:26.6508491Z There are no unused package(s) to remove.
2025-05-07T20:24:26.6508801Z There are no tempfile(s) to remove.
2025-05-07T20:24:26.6509092Z There are no logfile(s) to remove.
2025-05-07T20:24:26.7165547Z 
2025-05-07T20:24:26.7169889Z + conda info
2025-05-07T20:24:26.7170046Z 
2025-05-07T20:24:27.4908510Z 
2025-05-07T20:24:27.4908947Z      active environment : base
2025-05-07T20:24:27.4909402Z     active env location : /home/ec2-user/miniconda
2025-05-07T20:24:27.4909721Z             shell level : 1
2025-05-07T20:24:27.4909997Z        user config file : /home/ec2-user/.condarc
2025-05-07T20:24:27.4910378Z  populated config files : /home/ec2-user/miniconda/.condarc
2025-05-07T20:24:27.4910772Z           conda version : 25.3.1
2025-05-07T20:24:27.4911050Z     conda-build version : not installed
2025-05-07T20:24:27.4911346Z          python version : 3.13.2.final.0
2025-05-07T20:24:27.4911638Z                  solver : libmamba (default)
2025-05-07T20:24:27.4911941Z        virtual packages : __archspec=1=zen2
2025-05-07T20:24:27.4912234Z                           __conda=25.3.1=0
2025-05-07T20:24:27.4912507Z                           __cuda=12.8=0
2025-05-07T20:24:27.4912774Z                           __glibc=2.34=0
2025-05-07T20:24:27.4913048Z                           __linux=6.1.130=0
2025-05-07T20:24:27.4913325Z                           __unix=0=0
2025-05-07T20:24:27.4913651Z        base environment : /home/ec2-user/miniconda  (writable)
2025-05-07T20:24:27.4914052Z       conda av data dir : /home/ec2-user/miniconda/etc/conda
2025-05-07T20:24:27.4914398Z   conda av metadata url : None
2025-05-07T20:24:27.4915070Z            channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
2025-05-07T20:24:27.4915509Z                           https://repo.anaconda.com/pkgs/main/noarch
2025-05-07T20:24:27.4915890Z                           https://repo.anaconda.com/pkgs/r/linux-64
2025-05-07T20:24:27.4916259Z                           https://repo.anaconda.com/pkgs/r/noarch
2025-05-07T20:24:27.4916620Z           package cache : /home/ec2-user/miniconda/pkgs
2025-05-07T20:24:27.4916958Z                           /home/ec2-user/.conda/pkgs
2025-05-07T20:24:27.4917294Z        envs directories : /home/ec2-user/miniconda/envs
2025-05-07T20:24:27.4917620Z                           /home/ec2-user/.conda/envs
2025-05-07T20:24:27.4917952Z                platform : linux-64
2025-05-07T20:24:27.4918791Z              user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/.
2025-05-07T20:24:27.4919607Z                 UID:GID : 1000:1000
2025-05-07T20:24:27.4919879Z              netrc file : None
2025-05-07T20:24:27.4920138Z            offline mode : False
2025-05-07T20:24:27.4920304Z 
2025-05-07T20:24:27.5569283Z 
2025-05-07T20:24:27.5569790Z [SETUP] Exporting Miniconda variables ...
2025-05-07T20:24:27.5570825Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_9383b14d-7f66-434c-93e5-e2304d3bdbb6 ...
2025-05-07T20:24:27.5571911Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda
2025-05-07T20:24:27.5691697Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.12
2025-05-07T20:24:27.5692190Z [36;1m. $PRELUDE; create_conda_environment $BUILD_ENV 3.12[0m
2025-05-07T20:24:27.5710436Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:24:27.5710796Z env:
2025-05-07T20:24:27.5711023Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:24:27.5711320Z   BUILD_ENV: build_binary
2025-05-07T20:24:27.5711571Z   BUILD_TARGET: genai
2025-05-07T20:24:27.5711795Z   BUILD_VARIANT: cuda
2025-05-07T20:24:27.5712204Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:24:27.5712460Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:24:27.5712762Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:24:27.5713093Z ##[endgroup]
2025-05-07T20:24:27.9091783Z ################################################################################
2025-05-07T20:24:27.9092297Z # Create Conda Environment
2025-05-07T20:24:27.9092575Z #
2025-05-07T20:24:27.9107308Z # [2025-05-07T20:24:27.910Z] + create_conda_environment build_binary 3.12
2025-05-07T20:24:27.9107876Z ################################################################################
2025-05-07T20:24:27.9108218Z 
2025-05-07T20:24:27.9122926Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:24:28.0040539Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:24:28.0041044Z [SETUP] Listing existing Conda environments ...
2025-05-07T20:24:28.0041470Z + conda info --envs
2025-05-07T20:24:28.0041638Z 
2025-05-07T20:24:28.7601589Z 
2025-05-07T20:24:28.7602089Z # conda environments:
2025-05-07T20:24:28.7602371Z #
2025-05-07T20:24:28.7602605Z base                   /home/ec2-user/miniconda
2025-05-07T20:24:28.7602830Z 
2025-05-07T20:24:28.8257129Z 
2025-05-07T20:24:28.8257971Z [SETUP] Deleting the prefix directory if it exists ...
2025-05-07T20:24:30.4592128Z + rm -rf /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:30.4592497Z 
2025-05-07T20:24:30.4607634Z 
2025-05-07T20:24:30.4617115Z [SETUP] Creating new Conda environment (Python 3.12) ...
2025-05-07T20:24:30.4638697Z [EXEC] [ATTEMPT 0/3]    + conda create -y -n build_binary python=3.12
2025-05-07T20:24:31.2239724Z Channels:
2025-05-07T20:24:31.2240050Z  - defaults
2025-05-07T20:24:31.2240333Z Platform: linux-64
2025-05-07T20:24:32.7499577Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done
2025-05-07T20:24:32.8505047Z Solving environment: / done
2025-05-07T20:24:32.8792153Z 
2025-05-07T20:24:32.8792485Z ## Package Plan ##
2025-05-07T20:24:32.8792734Z 
2025-05-07T20:24:32.8793033Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:32.8793435Z 
2025-05-07T20:24:32.8793570Z   added / updated specs:
2025-05-07T20:24:32.8793896Z     - python=3.12
2025-05-07T20:24:32.8794032Z 
2025-05-07T20:24:32.8794037Z 
2025-05-07T20:24:32.8794162Z The following packages will be downloaded:
2025-05-07T20:24:32.8794374Z 
2025-05-07T20:24:32.8794504Z     package                    |            build
2025-05-07T20:24:32.8794826Z     ---------------------------|-----------------
2025-05-07T20:24:32.8795188Z     _libgcc_mutex-0.1          |             main           3 KB
2025-05-07T20:24:32.8795583Z     _openmp_mutex-5.1          |            1_gnu          21 KB
2025-05-07T20:24:32.8795993Z     ca-certificates-2025.2.25  |       h06a4308_0         129 KB
2025-05-07T20:24:32.8796500Z     python-3.12.9              |       h5148396_0        34.7 MB
2025-05-07T20:24:32.8796965Z     setuptools-78.1.1          |  py312h06a4308_0         2.2 MB
2025-05-07T20:24:32.8797356Z     wheel-0.45.1               |  py312h06a4308_0         147 KB
2025-05-07T20:24:32.8797720Z     ------------------------------------------------------------
2025-05-07T20:24:32.8798075Z                                            Total:        37.2 MB
2025-05-07T20:24:32.8798279Z 
2025-05-07T20:24:32.8798413Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:32.8798634Z 
2025-05-07T20:24:32.8799199Z   _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
2025-05-07T20:24:32.8799654Z   _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
2025-05-07T20:24:32.8800073Z   bzip2              pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 
2025-05-07T20:24:32.8800601Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 
2025-05-07T20:24:32.8801081Z   expat              pkgs/main/linux-64::expat-2.7.1-h6a678d5_0 
2025-05-07T20:24:32.8801528Z   ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 
2025-05-07T20:24:32.8802130Z   libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 
2025-05-07T20:24:32.8802553Z   libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
2025-05-07T20:24:32.8802979Z   libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
2025-05-07T20:24:32.8803433Z   libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
2025-05-07T20:24:32.8804032Z   libuuid            pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 
2025-05-07T20:24:32.8804603Z   ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
2025-05-07T20:24:32.8805047Z   openssl            pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 
2025-05-07T20:24:32.8805461Z   pip                pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:32.8805868Z   python             pkgs/main/linux-64::python-3.12.9-h5148396_0 
2025-05-07T20:24:32.8806292Z   readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
2025-05-07T20:24:32.8806767Z   setuptools         pkgs/main/linux-64::setuptools-78.1.1-py312h06a4308_0 
2025-05-07T20:24:32.8807245Z   sqlite             pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 
2025-05-07T20:24:32.8807635Z   tk                 pkgs/main/linux-64::tk-8.6.14-h39e8969_0 
2025-05-07T20:24:32.8808019Z   tzdata             pkgs/main/noarch::tzdata-2025b-h04d1e81_0 
2025-05-07T20:24:32.8808441Z   wheel              pkgs/main/linux-64::wheel-0.45.1-py312h06a4308_0 
2025-05-07T20:24:32.8808845Z   xz                 pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 
2025-05-07T20:24:32.8809221Z   zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 
2025-05-07T20:24:32.8809464Z 
2025-05-07T20:24:32.8809468Z 
2025-05-07T20:24:32.8809473Z 
2025-05-07T20:24:32.8809619Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:32.8809996Z python-3.12.9        | 34.7 MB   |            |   0% 
2025-05-07T20:24:32.8810222Z 
2025-05-07T20:24:32.8810612Z setuptools-78.1.1    | 2.2 MB    |            |   0% [A
2025-05-07T20:24:32.8810882Z 
2025-05-07T20:24:32.8810886Z 
2025-05-07T20:24:32.8825743Z wheel-0.45.1         | 147 KB    |            |   0% [A[A
2025-05-07T20:24:32.8826068Z 
2025-05-07T20:24:32.8826074Z 
2025-05-07T20:24:32.8828812Z 
2025-05-07T20:24:32.8836262Z ca-certificates-2025 | 129 KB    |            |   0% [A[A[A
2025-05-07T20:24:32.8836536Z 
2025-05-07T20:24:32.8836563Z 
2025-05-07T20:24:32.8836567Z 
2025-05-07T20:24:32.8842935Z 
2025-05-07T20:24:32.8861885Z _openmp_mutex-5.1    | 21 KB     |            |   0% [A[A[A[A
2025-05-07T20:24:32.8862212Z 
2025-05-07T20:24:32.8862217Z 
2025-05-07T20:24:32.8862221Z 
2025-05-07T20:24:32.8862226Z 
2025-05-07T20:24:32.8862231Z 
2025-05-07T20:24:32.9270460Z _libgcc_mutex-0.1    | 3 KB      |            |   0% [A[A[A[A[A
2025-05-07T20:24:32.9270771Z 
2025-05-07T20:24:32.9270776Z 
2025-05-07T20:24:32.9270790Z 
2025-05-07T20:24:32.9270794Z 
2025-05-07T20:24:32.9306915Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:32.9307273Z 
2025-05-07T20:24:32.9312738Z 
2025-05-07T20:24:32.9513844Z wheel-0.45.1         | 147 KB    | ########## | 100% [A[A
2025-05-07T20:24:32.9514191Z 
2025-05-07T20:24:32.9514222Z 
2025-05-07T20:24:32.9514227Z 
2025-05-07T20:24:32.9514232Z 
2025-05-07T20:24:32.9518898Z 
2025-05-07T20:24:32.9797968Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:32.9807703Z python-3.12.9        | 34.7 MB   | 7          |   7% 
2025-05-07T20:24:32.9807959Z 
2025-05-07T20:24:32.9828181Z setuptools-78.1.1    | 2.2 MB    | #####7     |  57% [A
2025-05-07T20:24:32.9828500Z 
2025-05-07T20:24:32.9828506Z 
2025-05-07T20:24:32.9828512Z 
2025-05-07T20:24:32.9828517Z 
2025-05-07T20:24:32.9831187Z 
2025-05-07T20:24:32.9837039Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:32.9837322Z 
2025-05-07T20:24:32.9837327Z 
2025-05-07T20:24:32.9838371Z 
2025-05-07T20:24:32.9845600Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:32.9845877Z 
2025-05-07T20:24:32.9846153Z 
2025-05-07T20:24:32.9849160Z 
2025-05-07T20:24:33.0260582Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:33.0260886Z 
2025-05-07T20:24:33.0261093Z 
2025-05-07T20:24:33.0262999Z 
2025-05-07T20:24:33.0432296Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:33.0432867Z 
2025-05-07T20:24:33.0800965Z setuptools-78.1.1    | 2.2 MB    | ########## | 100% [A
2025-05-07T20:24:33.0865151Z python-3.12.9        | 34.7 MB   | #8         |  19% 
2025-05-07T20:24:33.0865505Z 
2025-05-07T20:24:33.0865511Z 
2025-05-07T20:24:33.0865517Z 
2025-05-07T20:24:33.0866915Z 
2025-05-07T20:24:33.0873403Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:33.0873799Z 
2025-05-07T20:24:33.0873805Z 
2025-05-07T20:24:33.0873811Z 
2025-05-07T20:24:33.0875590Z 
2025-05-07T20:24:33.1107668Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:33.1108004Z 
2025-05-07T20:24:33.1108010Z 
2025-05-07T20:24:33.1110962Z wheel-0.45.1         | 147 KB    | ########## | 100% [A[A
2025-05-07T20:24:33.1111246Z 
2025-05-07T20:24:33.1111329Z 
2025-05-07T20:24:33.1802312Z wheel-0.45.1         | 147 KB    | ########## | 100% [A[A
2025-05-07T20:24:33.3679131Z python-3.12.9        | 34.7 MB   | #####4     |  54% 
2025-05-07T20:24:33.3679983Z python-3.12.9        | 34.7 MB   | ########## | 100% 
2025-05-07T20:24:33.4448927Z python-3.12.9        | 34.7 MB   | ########## | 100% 
2025-05-07T20:24:33.4449190Z 
2025-05-07T20:24:34.0215425Z setuptools-78.1.1    | 2.2 MB    | ########## | 100% [A
2025-05-07T20:24:34.0222465Z python-3.12.9        | 34.7 MB   | ########## | 100% 
2025-05-07T20:24:34.0222818Z                                                      
2025-05-07T20:24:34.0223020Z 
2025-05-07T20:24:34.0223235Z                                                      [A
2025-05-07T20:24:34.0223435Z 
2025-05-07T20:24:34.0223440Z 
2025-05-07T20:24:34.0223623Z                                                      [A[A
2025-05-07T20:24:34.0223840Z 
2025-05-07T20:24:34.0223844Z 
2025-05-07T20:24:34.0223847Z 
2025-05-07T20:24:34.0224018Z                                                      [A[A[A
2025-05-07T20:24:34.0224228Z 
2025-05-07T20:24:34.0224232Z 
2025-05-07T20:24:34.0224236Z 
2025-05-07T20:24:34.0224239Z 
2025-05-07T20:24:34.0224413Z                                                      [A[A[A[A
2025-05-07T20:24:34.0224624Z 
2025-05-07T20:24:34.0224628Z 
2025-05-07T20:24:34.0224631Z 
2025-05-07T20:24:34.0224641Z 
2025-05-07T20:24:34.0224645Z 
2025-05-07T20:24:34.0224847Z                                                      [A[A[A[A[A done
2025-05-07T20:24:34.2331618Z Preparing transaction: \ | done
2025-05-07T20:24:35.6495024Z Verifying transaction: - \ | / - \ | / - \ | / - done
2025-05-07T20:24:37.9723929Z Executing transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:38.0230226Z #
2025-05-07T20:24:38.0230502Z # To activate this environment, use
2025-05-07T20:24:38.0231025Z #
2025-05-07T20:24:38.0231266Z #     $ conda activate build_binary
2025-05-07T20:24:38.0231538Z #
2025-05-07T20:24:38.0231760Z # To deactivate an active environment, use
2025-05-07T20:24:38.0232075Z #
2025-05-07T20:24:38.0232287Z #     $ conda deactivate
2025-05-07T20:24:38.0232463Z 
2025-05-07T20:24:38.1311278Z [SETUP] Upgrading PIP to latest ...
2025-05-07T20:24:38.1333322Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --upgrade pip
2025-05-07T20:24:41.1392035Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (25.1)
2025-05-07T20:24:41.1392935Z Collecting pip
2025-05-07T20:24:41.1393404Z   Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
2025-05-07T20:24:41.1394010Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
2025-05-07T20:24:41.1397352Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 53.7 MB/s eta 0:00:00
2025-05-07T20:24:41.1398395Z Installing collected packages: pip
2025-05-07T20:24:41.1398835Z   Attempting uninstall: pip
2025-05-07T20:24:41.1399257Z     Found existing installation: pip 25.1
2025-05-07T20:24:41.1399682Z     Uninstalling pip-25.1:
2025-05-07T20:24:41.1400080Z       Successfully uninstalled pip-25.1
2025-05-07T20:24:41.1400538Z Successfully installed pip-25.1.1
2025-05-07T20:24:41.1400804Z 
2025-05-07T20:24:41.2046469Z [SETUP] Upgrading pyOpenSSL ...
2025-05-07T20:24:41.2069319Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0
2025-05-07T20:24:42.0608434Z Channels:
2025-05-07T20:24:42.0608674Z  - conda-forge
2025-05-07T20:24:42.0608896Z Platform: linux-64
2025-05-07T20:24:52.4420500Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:24:54.1494446Z Solving environment: / - \ | / done
2025-05-07T20:24:54.2127575Z 
2025-05-07T20:24:54.2128158Z ## Package Plan ##
2025-05-07T20:24:54.2128400Z 
2025-05-07T20:24:54.2128619Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:54.2128917Z 
2025-05-07T20:24:54.2129033Z   added / updated specs:
2025-05-07T20:24:54.2129306Z     - pyopenssl[version='>22.1.0']
2025-05-07T20:24:54.2129505Z 
2025-05-07T20:24:54.2129509Z 
2025-05-07T20:24:54.2129632Z The following packages will be downloaded:
2025-05-07T20:24:54.2129850Z 
2025-05-07T20:24:54.2129995Z     package                    |            build
2025-05-07T20:24:54.2130358Z     ---------------------------|-----------------
2025-05-07T20:24:54.2130799Z     cffi-1.17.1                |  py312h06ac9bb_0         288 KB  conda-forge
2025-05-07T20:24:54.2131246Z     cryptography-44.0.3        |  py312hda17c39_0         1.5 MB  conda-forge
2025-05-07T20:24:54.2131681Z     expat-2.7.0                |       h5888daf_0         137 KB  conda-forge
2025-05-07T20:24:54.2132087Z     libexpat-2.7.0             |       h5888daf_0          73 KB  conda-forge
2025-05-07T20:24:54.2132505Z     libgcc-15.1.0              |       h767d61c_2         810 KB  conda-forge
2025-05-07T20:24:54.2132920Z     libgcc-ng-15.1.0           |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:24:54.2133447Z     libgomp-15.1.0             |       h767d61c_2         442 KB  conda-forge
2025-05-07T20:24:54.2133852Z     libnsl-2.0.1               |       hd590300_0          33 KB  conda-forge
2025-05-07T20:24:54.2134271Z     libsqlite-3.46.0           |       hde9e2c9_0         845 KB  conda-forge
2025-05-07T20:24:54.2134700Z     libuuid-2.38.1             |       h0b41bf4_0          33 KB  conda-forge
2025-05-07T20:24:54.2135108Z     libxcrypt-4.4.36           |       hd590300_1          98 KB  conda-forge
2025-05-07T20:24:54.2135666Z     libzlib-1.2.13             |       h4ab18f5_6          60 KB  conda-forge
2025-05-07T20:24:54.2136080Z     openssl-3.5.0              |       h7b32b05_1         3.0 MB  conda-forge
2025-05-07T20:24:54.2136497Z     pycparser-2.22             |     pyh29332c3_1         108 KB  conda-forge
2025-05-07T20:24:54.2136930Z     pyopenssl-25.0.0           |     pyhd8ed1ab_0         120 KB  conda-forge
2025-05-07T20:24:54.2137368Z     python-3.12.2              |hab00c5b_0_cpython        30.8 MB  conda-forge
2025-05-07T20:24:54.2137790Z     python_abi-3.12            |          7_cp312           7 KB  conda-forge
2025-05-07T20:24:54.2138244Z     typing-extensions-4.13.2   |       h0e9735f_0          88 KB  conda-forge
2025-05-07T20:24:54.2139113Z     typing_extensions-4.13.2   |     pyh29332c3_0          51 KB  conda-forge
2025-05-07T20:24:54.2139552Z     zlib-1.2.13                |       h4ab18f5_6          91 KB  conda-forge
2025-05-07T20:24:54.2139933Z     ------------------------------------------------------------
2025-05-07T20:24:54.2140270Z                                            Total:        38.6 MB
2025-05-07T20:24:54.2140488Z 
2025-05-07T20:24:54.2140618Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:54.2140907Z 
2025-05-07T20:24:54.2141323Z   cffi               conda-forge/linux-64::cffi-1.17.1-py312h06ac9bb_0 
2025-05-07T20:24:54.2141915Z   cryptography       conda-forge/linux-64::cryptography-44.0.3-py312hda17c39_0 
2025-05-07T20:24:54.2142411Z   libexpat           conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 
2025-05-07T20:24:54.2142843Z   libgcc             conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 
2025-05-07T20:24:54.2143266Z   libnsl             conda-forge/linux-64::libnsl-2.0.1-hd590300_0 
2025-05-07T20:24:54.2145863Z   libsqlite          conda-forge/linux-64::libsqlite-3.46.0-hde9e2c9_0 
2025-05-07T20:24:54.2146369Z   libxcrypt          conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 
2025-05-07T20:24:54.2146919Z   libzlib            conda-forge/linux-64::libzlib-1.2.13-h4ab18f5_6 
2025-05-07T20:24:54.2147366Z   pycparser          conda-forge/noarch::pycparser-2.22-pyh29332c3_1 
2025-05-07T20:24:54.2147831Z   pyopenssl          conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 
2025-05-07T20:24:54.2148282Z   python_abi         conda-forge/noarch::python_abi-3.12-7_cp312 
2025-05-07T20:24:54.2148786Z   typing-extensions  conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 
2025-05-07T20:24:54.2149360Z   typing_extensions  conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 
2025-05-07T20:24:54.2149689Z 
2025-05-07T20:24:54.2149805Z The following packages will be UPDATED:
2025-05-07T20:24:54.2150037Z 
2025-05-07T20:24:54.2150501Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:54.2151486Z   libgcc-ng          pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 
2025-05-07T20:24:54.2152234Z   libgomp              pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 
2025-05-07T20:24:54.2152958Z   libuuid              pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 
2025-05-07T20:24:54.2153676Z   openssl              pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 
2025-05-07T20:24:54.2154363Z   zlib                    pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.2.13-h4ab18f5_6 
2025-05-07T20:24:54.2154751Z 
2025-05-07T20:24:54.2154995Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:24:54.2155356Z 
2025-05-07T20:24:54.2155630Z   expat                   pkgs/main::expat-2.7.1-h6a678d5_0 --> conda-forge::expat-2.7.0-h5888daf_0 
2025-05-07T20:24:54.2156338Z   python                pkgs/main::python-3.12.9-h5148396_0 --> conda-forge::python-3.12.2-hab00c5b_0_cpython 
2025-05-07T20:24:54.2156787Z 
2025-05-07T20:24:54.2156791Z 
2025-05-07T20:24:54.2156795Z 
2025-05-07T20:24:54.2156951Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:54.2157372Z python-3.12.2        | 30.8 MB   |            |   0% 
2025-05-07T20:24:54.2157596Z 
2025-05-07T20:24:54.2157983Z openssl-3.5.0        | 3.0 MB    |            |   0% [A
2025-05-07T20:24:54.2158211Z 
2025-05-07T20:24:54.2158219Z 
2025-05-07T20:24:54.2164616Z cryptography-44.0.3  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:24:54.2164989Z 
2025-05-07T20:24:54.2164993Z 
2025-05-07T20:24:54.2167792Z 
2025-05-07T20:24:54.2183075Z libsqlite-3.46.0     | 845 KB    |            |   0% [A[A[A
2025-05-07T20:24:54.2183449Z 
2025-05-07T20:24:54.2183455Z 
2025-05-07T20:24:54.2183460Z 
2025-05-07T20:24:54.2185176Z 
2025-05-07T20:24:54.2200077Z libgcc-15.1.0        | 810 KB    |            |   0% [A[A[A[A
2025-05-07T20:24:54.2200350Z 
2025-05-07T20:24:54.2200354Z 
2025-05-07T20:24:54.2200358Z 
2025-05-07T20:24:54.2200362Z 
2025-05-07T20:24:54.2200366Z 
2025-05-07T20:24:54.2201315Z libgomp-15.1.0       | 442 KB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:54.2201676Z 
2025-05-07T20:24:54.2201689Z 
2025-05-07T20:24:54.2201695Z 
2025-05-07T20:24:54.2201702Z 
2025-05-07T20:24:54.2201706Z 
2025-05-07T20:24:54.2201710Z 
2025-05-07T20:24:54.2203540Z cffi-1.17.1          | 288 KB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:54.2203990Z 
2025-05-07T20:24:54.2203994Z 
2025-05-07T20:24:54.2203997Z 
2025-05-07T20:24:54.2204001Z 
2025-05-07T20:24:54.2204005Z 
2025-05-07T20:24:54.2204008Z 
2025-05-07T20:24:54.2204012Z 
2025-05-07T20:24:54.2205039Z expat-2.7.0          | 137 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:54.2205379Z 
2025-05-07T20:24:54.2205382Z 
2025-05-07T20:24:54.2205391Z 
2025-05-07T20:24:54.2205395Z 
2025-05-07T20:24:54.2205398Z 
2025-05-07T20:24:54.2205412Z 
2025-05-07T20:24:54.2205416Z 
2025-05-07T20:24:54.2205419Z 
2025-05-07T20:24:54.2215385Z pyopenssl-25.0.0     | 120 KB    |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:54.2215671Z 
2025-05-07T20:24:54.2215675Z 
2025-05-07T20:24:54.2215679Z 
2025-05-07T20:24:54.2215682Z 
2025-05-07T20:24:54.2215686Z 
2025-05-07T20:24:54.2215690Z 
2025-05-07T20:24:54.2215701Z 
2025-05-07T20:24:54.2215705Z 
2025-05-07T20:24:54.2215708Z 
2025-05-07T20:24:54.2216715Z pycparser-2.22       | 108 KB    |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.2217062Z 
2025-05-07T20:24:54.2217066Z 
2025-05-07T20:24:54.2217083Z 
2025-05-07T20:24:54.2217086Z 
2025-05-07T20:24:54.2217090Z 
2025-05-07T20:24:54.2217094Z 
2025-05-07T20:24:54.2217097Z 
2025-05-07T20:24:54.2217101Z 
2025-05-07T20:24:54.2217105Z 
2025-05-07T20:24:54.2217108Z 
2025-05-07T20:24:54.2219878Z libxcrypt-4.4.36     | 98 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.2220173Z 
2025-05-07T20:24:54.2220185Z 
2025-05-07T20:24:54.2220189Z 
2025-05-07T20:24:54.2220193Z 
2025-05-07T20:24:54.2220196Z 
2025-05-07T20:24:54.2220200Z 
2025-05-07T20:24:54.2220203Z 
2025-05-07T20:24:54.2220207Z 
2025-05-07T20:24:54.2220210Z 
2025-05-07T20:24:54.2220214Z 
2025-05-07T20:24:54.2220217Z 
2025-05-07T20:24:54.2220971Z zlib-1.2.13          | 91 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.2221229Z 
2025-05-07T20:24:54.2221238Z 
2025-05-07T20:24:54.2221242Z 
2025-05-07T20:24:54.2221245Z 
2025-05-07T20:24:54.2221256Z 
2025-05-07T20:24:54.2221259Z 
2025-05-07T20:24:54.2221263Z 
2025-05-07T20:24:54.2221267Z 
2025-05-07T20:24:54.2221270Z 
2025-05-07T20:24:54.2221274Z 
2025-05-07T20:24:54.2221278Z 
2025-05-07T20:24:54.2222781Z 
2025-05-07T20:24:54.2224390Z typing-extensions-4. | 88 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.2224794Z 
2025-05-07T20:24:54.2224800Z 
2025-05-07T20:24:54.2224813Z 
2025-05-07T20:24:54.2224818Z 
2025-05-07T20:24:54.2224844Z 
2025-05-07T20:24:54.2224849Z 
2025-05-07T20:24:54.2224854Z 
2025-05-07T20:24:54.2224860Z 
2025-05-07T20:24:54.2224865Z 
2025-05-07T20:24:54.2224870Z 
2025-05-07T20:24:54.2224875Z 
2025-05-07T20:24:54.2224880Z 
2025-05-07T20:24:54.2224885Z 
2025-05-07T20:24:54.2227148Z libexpat-2.7.0       | 73 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.2227512Z 
2025-05-07T20:24:54.2227516Z 
2025-05-07T20:24:54.2227519Z 
2025-05-07T20:24:54.2227523Z 
2025-05-07T20:24:54.2227526Z 
2025-05-07T20:24:54.2227537Z 
2025-05-07T20:24:54.2227541Z 
2025-05-07T20:24:54.2227544Z 
2025-05-07T20:24:54.2227548Z 
2025-05-07T20:24:54.2227552Z 
2025-05-07T20:24:54.2227555Z 
2025-05-07T20:24:54.2227559Z 
2025-05-07T20:24:54.2227562Z 
2025-05-07T20:24:54.2227566Z 
2025-05-07T20:24:54.2228261Z libzlib-1.2.13       | 60 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.2228604Z 
2025-05-07T20:24:54.2228608Z 
2025-05-07T20:24:54.2228613Z 
2025-05-07T20:24:54.2228761Z 
2025-05-07T20:24:54.2228767Z 
2025-05-07T20:24:54.2228772Z 
2025-05-07T20:24:54.2228784Z 
2025-05-07T20:24:54.2228788Z 
2025-05-07T20:24:54.2228807Z 
2025-05-07T20:24:54.2228811Z 
2025-05-07T20:24:54.2228815Z 
2025-05-07T20:24:54.2228819Z 
2025-05-07T20:24:54.2228822Z 
2025-05-07T20:24:54.2228826Z 
2025-05-07T20:24:54.2228829Z 
2025-05-07T20:24:54.2230228Z typing_extensions-4. | 51 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.2230607Z 
2025-05-07T20:24:54.2230611Z 
2025-05-07T20:24:54.2230724Z 
2025-05-07T20:24:54.2230734Z 
2025-05-07T20:24:54.2230737Z 
2025-05-07T20:24:54.2230741Z 
2025-05-07T20:24:54.2230745Z 
2025-05-07T20:24:54.2230748Z 
2025-05-07T20:24:54.2230752Z 
2025-05-07T20:24:54.2230755Z 
2025-05-07T20:24:54.2230759Z 
2025-05-07T20:24:54.2230763Z 
2025-05-07T20:24:54.2230766Z 
2025-05-07T20:24:54.2230770Z 
2025-05-07T20:24:54.2230773Z 
2025-05-07T20:24:54.2230777Z 
2025-05-07T20:24:54.2231962Z libgcc-ng-15.1.0     | 34 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.2232352Z 
2025-05-07T20:24:54.2232358Z 
2025-05-07T20:24:54.2232371Z 
2025-05-07T20:24:54.2232376Z 
2025-05-07T20:24:54.2232381Z 
2025-05-07T20:24:54.2232386Z 
2025-05-07T20:24:54.2232391Z 
2025-05-07T20:24:54.2232397Z 
2025-05-07T20:24:54.2232402Z 
2025-05-07T20:24:54.2232416Z 
2025-05-07T20:24:54.2232422Z 
2025-05-07T20:24:54.2232427Z 
2025-05-07T20:24:54.2232432Z 
2025-05-07T20:24:54.2232437Z 
2025-05-07T20:24:54.2232442Z 
2025-05-07T20:24:54.2232455Z 
2025-05-07T20:24:54.2232460Z 
2025-05-07T20:24:54.2233142Z libuuid-2.38.1       | 33 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.2233542Z 
2025-05-07T20:24:54.2233547Z 
2025-05-07T20:24:54.2233562Z 
2025-05-07T20:24:54.2233567Z 
2025-05-07T20:24:54.2233581Z 
2025-05-07T20:24:54.2233586Z 
2025-05-07T20:24:54.2233591Z 
2025-05-07T20:24:54.2233596Z 
2025-05-07T20:24:54.2233601Z 
2025-05-07T20:24:54.2233606Z 
2025-05-07T20:24:54.2233611Z 
2025-05-07T20:24:54.2233628Z 
2025-05-07T20:24:54.2233633Z 
2025-05-07T20:24:54.2233638Z 
2025-05-07T20:24:54.2233643Z 
2025-05-07T20:24:54.2233648Z 
2025-05-07T20:24:54.2233653Z 
2025-05-07T20:24:54.2233658Z 
2025-05-07T20:24:54.2234766Z libnsl-2.0.1         | 33 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.2235169Z 
2025-05-07T20:24:54.2235175Z 
2025-05-07T20:24:54.2235180Z 
2025-05-07T20:24:54.2235186Z 
2025-05-07T20:24:54.2235191Z 
2025-05-07T20:24:54.2235196Z 
2025-05-07T20:24:54.2235210Z 
2025-05-07T20:24:54.2235303Z 
2025-05-07T20:24:54.2235309Z 
2025-05-07T20:24:54.2235314Z 
2025-05-07T20:24:54.2235319Z 
2025-05-07T20:24:54.2235324Z 
2025-05-07T20:24:54.2235337Z 
2025-05-07T20:24:54.2235343Z 
2025-05-07T20:24:54.2235348Z 
2025-05-07T20:24:54.2235353Z 
2025-05-07T20:24:54.2235358Z 
2025-05-07T20:24:54.2235363Z 
2025-05-07T20:24:54.2235368Z 
2025-05-07T20:24:54.3027595Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.3028991Z 
2025-05-07T20:24:54.3157735Z openssl-3.5.0        | 3.0 MB    | ########## | 100% [A
2025-05-07T20:24:54.3158073Z 
2025-05-07T20:24:54.3158399Z 
2025-05-07T20:24:54.3244499Z cryptography-44.0.3  | 1.5 MB    | ######3    |  64% [A[A
2025-05-07T20:24:54.3244864Z 
2025-05-07T20:24:54.3244870Z 
2025-05-07T20:24:54.3244875Z 
2025-05-07T20:24:54.3246789Z 
2025-05-07T20:24:54.3250474Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:54.3250767Z 
2025-05-07T20:24:54.3250788Z 
2025-05-07T20:24:54.3250791Z 
2025-05-07T20:24:54.3250795Z 
2025-05-07T20:24:54.3264349Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:54.3264670Z 
2025-05-07T20:24:54.3264676Z 
2025-05-07T20:24:54.3273566Z 
2025-05-07T20:24:54.3430929Z libsqlite-3.46.0     | 845 KB    | 1          |   2% [A[A[A
2025-05-07T20:24:54.3431190Z 
2025-05-07T20:24:54.3431537Z 
2025-05-07T20:24:54.3529367Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:24:54.3529813Z 
2025-05-07T20:24:54.3529831Z 
2025-05-07T20:24:54.3529836Z 
2025-05-07T20:24:54.3529841Z 
2025-05-07T20:24:54.3530119Z 
2025-05-07T20:24:54.3600068Z libgomp-15.1.0       | 442 KB    | 3          |   4% [A[A[A[A[A
2025-05-07T20:24:54.3764742Z python-3.12.2        | 30.8 MB   |            |   0% 
2025-05-07T20:24:54.3764998Z 
2025-05-07T20:24:54.3765003Z 
2025-05-07T20:24:54.3765006Z 
2025-05-07T20:24:54.3765010Z 
2025-05-07T20:24:54.3773435Z 
2025-05-07T20:24:54.3806873Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:54.3807394Z 
2025-05-07T20:24:54.3807398Z 
2025-05-07T20:24:54.3807402Z 
2025-05-07T20:24:54.3811996Z libsqlite-3.46.0     | 845 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:54.3812302Z 
2025-05-07T20:24:54.3812306Z 
2025-05-07T20:24:54.3812310Z 
2025-05-07T20:24:54.3812314Z 
2025-05-07T20:24:54.3812317Z 
2025-05-07T20:24:54.3814341Z 
2025-05-07T20:24:54.3964451Z cffi-1.17.1          | 288 KB    | 5          |   6% [A[A[A[A[A[A
2025-05-07T20:24:54.3964723Z 
2025-05-07T20:24:54.3964727Z 
2025-05-07T20:24:54.3964731Z 
2025-05-07T20:24:54.3964735Z 
2025-05-07T20:24:54.3964738Z 
2025-05-07T20:24:54.3964742Z 
2025-05-07T20:24:54.4110465Z cffi-1.17.1          | 288 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:54.4110747Z 
2025-05-07T20:24:54.4110752Z 
2025-05-07T20:24:54.4110755Z 
2025-05-07T20:24:54.4110759Z 
2025-05-07T20:24:54.4110762Z 
2025-05-07T20:24:54.4110774Z 
2025-05-07T20:24:54.4116052Z 
2025-05-07T20:24:54.4125742Z expat-2.7.0          | 137 KB    | #1         |  12% [A[A[A[A[A[A[A
2025-05-07T20:24:54.4126012Z 
2025-05-07T20:24:54.4126022Z 
2025-05-07T20:24:54.4126026Z 
2025-05-07T20:24:54.4126030Z 
2025-05-07T20:24:54.4126033Z 
2025-05-07T20:24:54.4126037Z 
2025-05-07T20:24:54.4126040Z 
2025-05-07T20:24:54.4127512Z 
2025-05-07T20:24:54.4192701Z pyopenssl-25.0.0     | 120 KB    | #3         |  13% [A[A[A[A[A[A[A[A
2025-05-07T20:24:54.4193055Z 
2025-05-07T20:24:54.4193061Z 
2025-05-07T20:24:54.4193081Z 
2025-05-07T20:24:54.4193085Z 
2025-05-07T20:24:54.4193088Z 
2025-05-07T20:24:54.4193092Z 
2025-05-07T20:24:54.4195146Z 
2025-05-07T20:24:54.4203883Z expat-2.7.0          | 137 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:54.4204226Z 
2025-05-07T20:24:54.4204231Z 
2025-05-07T20:24:54.4204234Z 
2025-05-07T20:24:54.4204238Z 
2025-05-07T20:24:54.4204241Z 
2025-05-07T20:24:54.4204245Z 
2025-05-07T20:24:54.4204249Z 
2025-05-07T20:24:54.4206170Z 
2025-05-07T20:24:54.4355262Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:54.4355568Z 
2025-05-07T20:24:54.4355572Z 
2025-05-07T20:24:54.4355575Z 
2025-05-07T20:24:54.4355579Z 
2025-05-07T20:24:54.4355582Z 
2025-05-07T20:24:54.4355586Z 
2025-05-07T20:24:54.4355590Z 
2025-05-07T20:24:54.4355593Z 
2025-05-07T20:24:54.4355600Z 
2025-05-07T20:24:54.4394408Z pycparser-2.22       | 108 KB    | #4         |  15% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.4394689Z 
2025-05-07T20:24:54.4394705Z 
2025-05-07T20:24:54.4394708Z 
2025-05-07T20:24:54.4394712Z 
2025-05-07T20:24:54.4394716Z 
2025-05-07T20:24:54.4394719Z 
2025-05-07T20:24:54.4394723Z 
2025-05-07T20:24:54.4394727Z 
2025-05-07T20:24:54.4394730Z 
2025-05-07T20:24:54.4487729Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.4488121Z 
2025-05-07T20:24:54.4488136Z 
2025-05-07T20:24:54.4488141Z 
2025-05-07T20:24:54.4488146Z 
2025-05-07T20:24:54.4488159Z 
2025-05-07T20:24:54.4488164Z 
2025-05-07T20:24:54.4488182Z 
2025-05-07T20:24:54.4488187Z 
2025-05-07T20:24:54.4488192Z 
2025-05-07T20:24:54.4488197Z 
2025-05-07T20:24:54.4519313Z libxcrypt-4.4.36     | 98 KB     | #6         |  16% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.4519715Z 
2025-05-07T20:24:54.4519719Z 
2025-05-07T20:24:54.4519723Z 
2025-05-07T20:24:54.4519727Z 
2025-05-07T20:24:54.4519731Z 
2025-05-07T20:24:54.4519734Z 
2025-05-07T20:24:54.4519738Z 
2025-05-07T20:24:54.4519742Z 
2025-05-07T20:24:54.4519746Z 
2025-05-07T20:24:54.4520360Z 
2025-05-07T20:24:54.4609405Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.4609814Z 
2025-05-07T20:24:54.4609818Z 
2025-05-07T20:24:54.4609822Z 
2025-05-07T20:24:54.4609826Z 
2025-05-07T20:24:54.4609829Z 
2025-05-07T20:24:54.4609833Z 
2025-05-07T20:24:54.4609845Z 
2025-05-07T20:24:54.4609848Z 
2025-05-07T20:24:54.4609852Z 
2025-05-07T20:24:54.4609856Z 
2025-05-07T20:24:54.4609859Z 
2025-05-07T20:24:54.4609863Z 
2025-05-07T20:24:54.4621465Z typing-extensions-4. | 88 KB     | #8         |  18% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.4621792Z 
2025-05-07T20:24:54.4621797Z 
2025-05-07T20:24:54.4621800Z 
2025-05-07T20:24:54.4622376Z 
2025-05-07T20:24:54.4651875Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:54.4652149Z 
2025-05-07T20:24:54.4652154Z 
2025-05-07T20:24:54.4652158Z 
2025-05-07T20:24:54.4652161Z 
2025-05-07T20:24:54.4652165Z 
2025-05-07T20:24:54.4652169Z 
2025-05-07T20:24:54.4652184Z 
2025-05-07T20:24:54.4652188Z 
2025-05-07T20:24:54.4652192Z 
2025-05-07T20:24:54.4652195Z 
2025-05-07T20:24:54.4653622Z 
2025-05-07T20:24:54.4665460Z zlib-1.2.13          | 91 KB     | #7         |  18% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.4665753Z 
2025-05-07T20:24:54.4665758Z 
2025-05-07T20:24:54.4665761Z 
2025-05-07T20:24:54.4665765Z 
2025-05-07T20:24:54.4665769Z 
2025-05-07T20:24:54.4665772Z 
2025-05-07T20:24:54.4665776Z 
2025-05-07T20:24:54.4665780Z 
2025-05-07T20:24:54.4665783Z 
2025-05-07T20:24:54.4665797Z 
2025-05-07T20:24:54.4665801Z 
2025-05-07T20:24:54.4665804Z 
2025-05-07T20:24:54.4708706Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.4709199Z 
2025-05-07T20:24:54.4709205Z 
2025-05-07T20:24:54.4709210Z 
2025-05-07T20:24:54.4709215Z 
2025-05-07T20:24:54.4709220Z 
2025-05-07T20:24:54.4709226Z 
2025-05-07T20:24:54.4709239Z 
2025-05-07T20:24:54.4709245Z 
2025-05-07T20:24:54.4709251Z 
2025-05-07T20:24:54.4709268Z 
2025-05-07T20:24:54.4709274Z 
2025-05-07T20:24:54.4736721Z zlib-1.2.13          | 91 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.4984575Z python-3.12.2        | 30.8 MB   | 3          |   4% 
2025-05-07T20:24:54.4984835Z 
2025-05-07T20:24:54.4984839Z 
2025-05-07T20:24:54.4984843Z 
2025-05-07T20:24:54.4984847Z 
2025-05-07T20:24:54.4984851Z 
2025-05-07T20:24:54.4984862Z 
2025-05-07T20:24:54.4984866Z 
2025-05-07T20:24:54.4984869Z 
2025-05-07T20:24:54.4984873Z 
2025-05-07T20:24:54.4984892Z 
2025-05-07T20:24:54.4984896Z 
2025-05-07T20:24:54.4984899Z 
2025-05-07T20:24:54.4984906Z 
2025-05-07T20:24:54.4984910Z 
2025-05-07T20:24:54.4986013Z 
2025-05-07T20:24:54.4989697Z typing_extensions-4. | 51 KB     | ###1       |  31% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.4990392Z 
2025-05-07T20:24:54.4990397Z 
2025-05-07T20:24:54.4990401Z 
2025-05-07T20:24:54.4990405Z 
2025-05-07T20:24:54.4990408Z 
2025-05-07T20:24:54.4990412Z 
2025-05-07T20:24:54.4990428Z 
2025-05-07T20:24:54.4990432Z 
2025-05-07T20:24:54.4990436Z 
2025-05-07T20:24:54.4990439Z 
2025-05-07T20:24:54.4990443Z 
2025-05-07T20:24:54.4990447Z 
2025-05-07T20:24:54.4990450Z 
2025-05-07T20:24:54.4990828Z 
2025-05-07T20:24:54.5045998Z libzlib-1.2.13       | 60 KB     | ##6        |  27% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.5046356Z 
2025-05-07T20:24:54.5046360Z 
2025-05-07T20:24:54.5046364Z 
2025-05-07T20:24:54.5046368Z 
2025-05-07T20:24:54.5046372Z 
2025-05-07T20:24:54.5046375Z 
2025-05-07T20:24:54.5046396Z 
2025-05-07T20:24:54.5046400Z 
2025-05-07T20:24:54.5046404Z 
2025-05-07T20:24:54.5046408Z 
2025-05-07T20:24:54.5046411Z 
2025-05-07T20:24:54.5046415Z 
2025-05-07T20:24:54.5046418Z 
2025-05-07T20:24:54.5046422Z 
2025-05-07T20:24:54.5050386Z 
2025-05-07T20:24:54.5058699Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.5059424Z 
2025-05-07T20:24:54.5059432Z 
2025-05-07T20:24:54.5059437Z 
2025-05-07T20:24:54.5059749Z 
2025-05-07T20:24:54.5059758Z 
2025-05-07T20:24:54.5059763Z 
2025-05-07T20:24:54.5059768Z 
2025-05-07T20:24:54.5059774Z 
2025-05-07T20:24:54.5059779Z 
2025-05-07T20:24:54.5059784Z 
2025-05-07T20:24:54.5059789Z 
2025-05-07T20:24:54.5059794Z 
2025-05-07T20:24:54.5059799Z 
2025-05-07T20:24:54.5063323Z 
2025-05-07T20:24:54.5165285Z libzlib-1.2.13       | 60 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.5165723Z 
2025-05-07T20:24:54.5165729Z 
2025-05-07T20:24:54.5165734Z 
2025-05-07T20:24:54.5165981Z 
2025-05-07T20:24:54.5165987Z 
2025-05-07T20:24:54.5165992Z 
2025-05-07T20:24:54.5165997Z 
2025-05-07T20:24:54.5166014Z 
2025-05-07T20:24:54.5166020Z 
2025-05-07T20:24:54.5166025Z 
2025-05-07T20:24:54.5166030Z 
2025-05-07T20:24:54.5166035Z 
2025-05-07T20:24:54.5166043Z 
2025-05-07T20:24:54.5166049Z 
2025-05-07T20:24:54.5166054Z 
2025-05-07T20:24:54.5166059Z 
2025-05-07T20:24:54.5192287Z libgcc-ng-15.1.0     | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.5192616Z 
2025-05-07T20:24:54.5192621Z 
2025-05-07T20:24:54.5192624Z 
2025-05-07T20:24:54.5192628Z 
2025-05-07T20:24:54.5192632Z 
2025-05-07T20:24:54.5192635Z 
2025-05-07T20:24:54.5192639Z 
2025-05-07T20:24:54.5192643Z 
2025-05-07T20:24:54.5192647Z 
2025-05-07T20:24:54.5192650Z 
2025-05-07T20:24:54.5192654Z 
2025-05-07T20:24:54.5192658Z 
2025-05-07T20:24:54.5192662Z 
2025-05-07T20:24:54.5213037Z libexpat-2.7.0       | 73 KB     | ##2        |  22% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.5213517Z 
2025-05-07T20:24:54.5213521Z 
2025-05-07T20:24:54.5213525Z 
2025-05-07T20:24:54.5213529Z 
2025-05-07T20:24:54.5213533Z 
2025-05-07T20:24:54.5213536Z 
2025-05-07T20:24:54.5213540Z 
2025-05-07T20:24:54.5213544Z 
2025-05-07T20:24:54.5213548Z 
2025-05-07T20:24:54.5213551Z 
2025-05-07T20:24:54.5213555Z 
2025-05-07T20:24:54.5213559Z 
2025-05-07T20:24:54.5213562Z 
2025-05-07T20:24:54.5213572Z 
2025-05-07T20:24:54.5213576Z 
2025-05-07T20:24:54.5218077Z 
2025-05-07T20:24:54.5289172Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.5289595Z 
2025-05-07T20:24:54.5289602Z 
2025-05-07T20:24:54.5289607Z 
2025-05-07T20:24:54.5289612Z 
2025-05-07T20:24:54.5289617Z 
2025-05-07T20:24:54.5289623Z 
2025-05-07T20:24:54.5289628Z 
2025-05-07T20:24:54.5289633Z 
2025-05-07T20:24:54.5289638Z 
2025-05-07T20:24:54.5289644Z 
2025-05-07T20:24:54.5289649Z 
2025-05-07T20:24:54.5289654Z 
2025-05-07T20:24:54.5289881Z 
2025-05-07T20:24:54.5390479Z libexpat-2.7.0       | 73 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.5390887Z 
2025-05-07T20:24:54.5390893Z 
2025-05-07T20:24:54.5390898Z 
2025-05-07T20:24:54.5390903Z 
2025-05-07T20:24:54.5390908Z 
2025-05-07T20:24:54.5390913Z 
2025-05-07T20:24:54.5390919Z 
2025-05-07T20:24:54.5390924Z 
2025-05-07T20:24:54.5390929Z 
2025-05-07T20:24:54.5390934Z 
2025-05-07T20:24:54.5390939Z 
2025-05-07T20:24:54.5390945Z 
2025-05-07T20:24:54.5390950Z 
2025-05-07T20:24:54.5390968Z 
2025-05-07T20:24:54.5390973Z 
2025-05-07T20:24:54.5390987Z 
2025-05-07T20:24:54.5390992Z 
2025-05-07T20:24:54.5433221Z libuuid-2.38.1       | 33 KB     | ####8      |  49% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.5433628Z 
2025-05-07T20:24:54.5433635Z 
2025-05-07T20:24:54.5433648Z 
2025-05-07T20:24:54.5433653Z 
2025-05-07T20:24:54.5433658Z 
2025-05-07T20:24:54.5433663Z 
2025-05-07T20:24:54.5433669Z 
2025-05-07T20:24:54.5433673Z 
2025-05-07T20:24:54.5433679Z 
2025-05-07T20:24:54.5433694Z 
2025-05-07T20:24:54.5433699Z 
2025-05-07T20:24:54.5433704Z 
2025-05-07T20:24:54.5433709Z 
2025-05-07T20:24:54.5433714Z 
2025-05-07T20:24:54.5433719Z 
2025-05-07T20:24:54.5433724Z 
2025-05-07T20:24:54.5436354Z 
2025-05-07T20:24:54.5667267Z libuuid-2.38.1       | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.5667666Z 
2025-05-07T20:24:54.5667672Z 
2025-05-07T20:24:54.5667677Z 
2025-05-07T20:24:54.5667682Z 
2025-05-07T20:24:54.5667952Z 
2025-05-07T20:24:54.5667961Z 
2025-05-07T20:24:54.5667967Z 
2025-05-07T20:24:54.5667972Z 
2025-05-07T20:24:54.5667977Z 
2025-05-07T20:24:54.5667982Z 
2025-05-07T20:24:54.5667995Z 
2025-05-07T20:24:54.5668011Z 
2025-05-07T20:24:54.5668017Z 
2025-05-07T20:24:54.5668022Z 
2025-05-07T20:24:54.5668025Z 
2025-05-07T20:24:54.5668029Z 
2025-05-07T20:24:54.5668032Z 
2025-05-07T20:24:54.5668036Z 
2025-05-07T20:24:54.5697673Z libnsl-2.0.1         | 33 KB     | ####9      |  49% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.5698293Z 
2025-05-07T20:24:54.5698299Z 
2025-05-07T20:24:54.5698304Z 
2025-05-07T20:24:54.5698309Z 
2025-05-07T20:24:54.5698314Z 
2025-05-07T20:24:54.5698319Z 
2025-05-07T20:24:54.5698324Z 
2025-05-07T20:24:54.5698330Z 
2025-05-07T20:24:54.5698335Z 
2025-05-07T20:24:54.5698340Z 
2025-05-07T20:24:54.5698345Z 
2025-05-07T20:24:54.5698350Z 
2025-05-07T20:24:54.5698355Z 
2025-05-07T20:24:54.5698360Z 
2025-05-07T20:24:54.5698365Z 
2025-05-07T20:24:54.5698378Z 
2025-05-07T20:24:54.5698383Z 
2025-05-07T20:24:54.5700096Z 
2025-05-07T20:24:54.5739271Z libnsl-2.0.1         | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.5750384Z python-3.12.2        | 30.8 MB   | #4         |  15% 
2025-05-07T20:24:54.5750713Z 
2025-05-07T20:24:54.5750718Z 
2025-05-07T20:24:54.5750723Z 
2025-05-07T20:24:54.5750728Z 
2025-05-07T20:24:54.5750734Z 
2025-05-07T20:24:54.5750804Z 
2025-05-07T20:24:54.5750810Z 
2025-05-07T20:24:54.5750815Z 
2025-05-07T20:24:54.5750834Z 
2025-05-07T20:24:54.5750839Z 
2025-05-07T20:24:54.5750845Z 
2025-05-07T20:24:54.5751064Z 
2025-05-07T20:24:54.5751076Z 
2025-05-07T20:24:54.5751082Z 
2025-05-07T20:24:54.5751087Z 
2025-05-07T20:24:54.5751093Z 
2025-05-07T20:24:54.5751098Z 
2025-05-07T20:24:54.5751104Z 
2025-05-07T20:24:54.5751108Z 
2025-05-07T20:24:54.5788476Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.5788835Z 
2025-05-07T20:24:54.5788852Z 
2025-05-07T20:24:54.5788858Z 
2025-05-07T20:24:54.5788864Z 
2025-05-07T20:24:54.5788869Z 
2025-05-07T20:24:54.5788875Z 
2025-05-07T20:24:54.5788881Z 
2025-05-07T20:24:54.5788886Z 
2025-05-07T20:24:54.5788891Z 
2025-05-07T20:24:54.5788895Z 
2025-05-07T20:24:54.5788908Z 
2025-05-07T20:24:54.5788913Z 
2025-05-07T20:24:54.5788918Z 
2025-05-07T20:24:54.5788923Z 
2025-05-07T20:24:54.5788927Z 
2025-05-07T20:24:54.5788931Z 
2025-05-07T20:24:54.5788935Z 
2025-05-07T20:24:54.5788939Z 
2025-05-07T20:24:54.5788943Z 
2025-05-07T20:24:54.5899738Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.5900022Z 
2025-05-07T20:24:54.5900026Z 
2025-05-07T20:24:54.5900030Z 
2025-05-07T20:24:54.5900034Z 
2025-05-07T20:24:54.5900037Z 
2025-05-07T20:24:54.5907125Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:54.5907437Z 
2025-05-07T20:24:54.5907443Z 
2025-05-07T20:24:54.5907448Z 
2025-05-07T20:24:54.5907453Z 
2025-05-07T20:24:54.5907457Z 
2025-05-07T20:24:54.6626664Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:54.6626969Z 
2025-05-07T20:24:54.6626975Z 
2025-05-07T20:24:54.6627351Z 
2025-05-07T20:24:54.6634102Z libsqlite-3.46.0     | 845 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:54.6634384Z 
2025-05-07T20:24:54.6634389Z 
2025-05-07T20:24:54.6634394Z 
2025-05-07T20:24:54.6743307Z libsqlite-3.46.0     | 845 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:54.7746445Z python-3.12.2        | 30.8 MB   | ##9        |  30% 
2025-05-07T20:24:54.7803619Z python-3.12.2        | 30.8 MB   | ####3      |  44% 
2025-05-07T20:24:54.7803856Z 
2025-05-07T20:24:54.7804155Z 
2025-05-07T20:24:54.7804164Z 
2025-05-07T20:24:54.7804246Z 
2025-05-07T20:24:54.7804253Z 
2025-05-07T20:24:54.7804271Z 
2025-05-07T20:24:54.7819067Z cffi-1.17.1          | 288 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:54.7819343Z 
2025-05-07T20:24:54.7819349Z 
2025-05-07T20:24:54.7819353Z 
2025-05-07T20:24:54.7819358Z 
2025-05-07T20:24:54.7819621Z 
2025-05-07T20:24:54.7819631Z 
2025-05-07T20:24:54.8208442Z cffi-1.17.1          | 288 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:54.8211780Z 
2025-05-07T20:24:54.8222416Z openssl-3.5.0        | 3.0 MB    | ########## | 100% [A
2025-05-07T20:24:54.8224675Z 
2025-05-07T20:24:54.8445657Z openssl-3.5.0        | 3.0 MB    | ########## | 100% [A
2025-05-07T20:24:54.8445954Z 
2025-05-07T20:24:54.8445960Z 
2025-05-07T20:24:54.8445964Z 
2025-05-07T20:24:54.8445969Z 
2025-05-07T20:24:54.8446220Z 
2025-05-07T20:24:54.8446225Z 
2025-05-07T20:24:54.8446266Z 
2025-05-07T20:24:54.8457945Z expat-2.7.0          | 137 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:54.8458292Z 
2025-05-07T20:24:54.8458298Z 
2025-05-07T20:24:54.8458303Z 
2025-05-07T20:24:54.8458308Z 
2025-05-07T20:24:54.8458313Z 
2025-05-07T20:24:54.8458318Z 
2025-05-07T20:24:54.8458977Z 
2025-05-07T20:24:54.8747665Z expat-2.7.0          | 137 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:54.9306860Z python-3.12.2        | 30.8 MB   | #####6     |  57% 
2025-05-07T20:24:54.9307126Z 
2025-05-07T20:24:54.9307132Z 
2025-05-07T20:24:54.9307138Z 
2025-05-07T20:24:54.9307144Z 
2025-05-07T20:24:54.9307149Z 
2025-05-07T20:24:54.9307154Z 
2025-05-07T20:24:54.9307160Z 
2025-05-07T20:24:54.9307782Z 
2025-05-07T20:24:54.9323099Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:54.9323511Z 
2025-05-07T20:24:54.9323520Z 
2025-05-07T20:24:54.9323529Z 
2025-05-07T20:24:54.9323560Z 
2025-05-07T20:24:54.9323570Z 
2025-05-07T20:24:54.9323579Z 
2025-05-07T20:24:54.9323589Z 
2025-05-07T20:24:54.9324953Z 
2025-05-07T20:24:54.9752403Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:54.9907411Z python-3.12.2        | 30.8 MB   | #######2   |  72% 
2025-05-07T20:24:54.9907754Z 
2025-05-07T20:24:54.9907762Z 
2025-05-07T20:24:54.9907768Z 
2025-05-07T20:24:54.9907774Z 
2025-05-07T20:24:54.9907779Z 
2025-05-07T20:24:54.9907812Z 
2025-05-07T20:24:54.9907818Z 
2025-05-07T20:24:54.9907823Z 
2025-05-07T20:24:54.9907828Z 
2025-05-07T20:24:54.9909252Z 
2025-05-07T20:24:54.9913457Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.9913842Z 
2025-05-07T20:24:54.9913848Z 
2025-05-07T20:24:54.9913862Z 
2025-05-07T20:24:54.9913868Z 
2025-05-07T20:24:54.9913873Z 
2025-05-07T20:24:54.9913878Z 
2025-05-07T20:24:54.9913883Z 
2025-05-07T20:24:54.9913888Z 
2025-05-07T20:24:54.9913893Z 
2025-05-07T20:24:54.9913911Z 
2025-05-07T20:24:54.9987819Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.9988216Z 
2025-05-07T20:24:54.9989303Z 
2025-05-07T20:24:55.0259577Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:24:55.0259967Z 
2025-05-07T20:24:55.0259973Z 
2025-05-07T20:24:55.0259978Z 
2025-05-07T20:24:55.0259983Z 
2025-05-07T20:24:55.0259988Z 
2025-05-07T20:24:55.0259994Z 
2025-05-07T20:24:55.0259999Z 
2025-05-07T20:24:55.0260027Z 
2025-05-07T20:24:55.0260032Z 
2025-05-07T20:24:55.0260038Z 
2025-05-07T20:24:55.0260043Z 
2025-05-07T20:24:55.0260073Z 
2025-05-07T20:24:55.0272141Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.0272563Z 
2025-05-07T20:24:55.0272569Z 
2025-05-07T20:24:55.0272574Z 
2025-05-07T20:24:55.0272579Z 
2025-05-07T20:24:55.0272594Z 
2025-05-07T20:24:55.0272599Z 
2025-05-07T20:24:55.0272604Z 
2025-05-07T20:24:55.0272609Z 
2025-05-07T20:24:55.0272625Z 
2025-05-07T20:24:55.0272630Z 
2025-05-07T20:24:55.0272635Z 
2025-05-07T20:24:55.0272640Z 
2025-05-07T20:24:55.0464592Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.0465025Z 
2025-05-07T20:24:55.0465031Z 
2025-05-07T20:24:55.0465036Z 
2025-05-07T20:24:55.0465041Z 
2025-05-07T20:24:55.0465046Z 
2025-05-07T20:24:55.0465051Z 
2025-05-07T20:24:55.0465057Z 
2025-05-07T20:24:55.0465063Z 
2025-05-07T20:24:55.0465068Z 
2025-05-07T20:24:55.0465317Z 
2025-05-07T20:24:55.0466240Z 
2025-05-07T20:24:55.0475902Z zlib-1.2.13          | 91 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.0476261Z 
2025-05-07T20:24:55.0476267Z 
2025-05-07T20:24:55.0476272Z 
2025-05-07T20:24:55.0476277Z 
2025-05-07T20:24:55.0476282Z 
2025-05-07T20:24:55.0476287Z 
2025-05-07T20:24:55.0476292Z 
2025-05-07T20:24:55.0476297Z 
2025-05-07T20:24:55.0476303Z 
2025-05-07T20:24:55.0476308Z 
2025-05-07T20:24:55.0479203Z 
2025-05-07T20:24:55.0659003Z zlib-1.2.13          | 91 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.0659827Z 
2025-05-07T20:24:55.0659833Z 
2025-05-07T20:24:55.0659838Z 
2025-05-07T20:24:55.0659843Z 
2025-05-07T20:24:55.0659856Z 
2025-05-07T20:24:55.0659862Z 
2025-05-07T20:24:55.0659867Z 
2025-05-07T20:24:55.0659872Z 
2025-05-07T20:24:55.0659877Z 
2025-05-07T20:24:55.0659882Z 
2025-05-07T20:24:55.0659887Z 
2025-05-07T20:24:55.0659893Z 
2025-05-07T20:24:55.0659898Z 
2025-05-07T20:24:55.0659911Z 
2025-05-07T20:24:55.0660575Z 
2025-05-07T20:24:55.0666821Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.0667262Z 
2025-05-07T20:24:55.0667268Z 
2025-05-07T20:24:55.0667273Z 
2025-05-07T20:24:55.0667278Z 
2025-05-07T20:24:55.0667283Z 
2025-05-07T20:24:55.0667288Z 
2025-05-07T20:24:55.0667293Z 
2025-05-07T20:24:55.0667299Z 
2025-05-07T20:24:55.0667303Z 
2025-05-07T20:24:55.0667309Z 
2025-05-07T20:24:55.0667314Z 
2025-05-07T20:24:55.0667319Z 
2025-05-07T20:24:55.0667337Z 
2025-05-07T20:24:55.0667342Z 
2025-05-07T20:24:55.0667347Z 
2025-05-07T20:24:55.0752035Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.0864370Z python-3.12.2        | 30.8 MB   | ########6  |  86% 
2025-05-07T20:24:55.0864699Z 
2025-05-07T20:24:55.0864861Z 
2025-05-07T20:24:55.0864867Z 
2025-05-07T20:24:55.0864873Z 
2025-05-07T20:24:55.0864895Z 
2025-05-07T20:24:55.0864900Z 
2025-05-07T20:24:55.0864919Z 
2025-05-07T20:24:55.0864924Z 
2025-05-07T20:24:55.0864929Z 
2025-05-07T20:24:55.0864934Z 
2025-05-07T20:24:55.0864939Z 
2025-05-07T20:24:55.0864944Z 
2025-05-07T20:24:55.0864949Z 
2025-05-07T20:24:55.0864958Z 
2025-05-07T20:24:55.0870795Z libzlib-1.2.13       | 60 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.0871202Z 
2025-05-07T20:24:55.0871208Z 
2025-05-07T20:24:55.0871213Z 
2025-05-07T20:24:55.0871219Z 
2025-05-07T20:24:55.0871224Z 
2025-05-07T20:24:55.0871229Z 
2025-05-07T20:24:55.0871246Z 
2025-05-07T20:24:55.0871251Z 
2025-05-07T20:24:55.0871256Z 
2025-05-07T20:24:55.0871261Z 
2025-05-07T20:24:55.0871266Z 
2025-05-07T20:24:55.0871271Z 
2025-05-07T20:24:55.0871277Z 
2025-05-07T20:24:55.0871282Z 
2025-05-07T20:24:55.1016758Z libzlib-1.2.13       | 60 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.1017157Z 
2025-05-07T20:24:55.1017169Z 
2025-05-07T20:24:55.1017175Z 
2025-05-07T20:24:55.1017180Z 
2025-05-07T20:24:55.1017197Z 
2025-05-07T20:24:55.1017203Z 
2025-05-07T20:24:55.1017208Z 
2025-05-07T20:24:55.1017225Z 
2025-05-07T20:24:55.1019399Z 
2025-05-07T20:24:55.1029908Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.1030300Z 
2025-05-07T20:24:55.1030305Z 
2025-05-07T20:24:55.1030320Z 
2025-05-07T20:24:55.1030325Z 
2025-05-07T20:24:55.1030330Z 
2025-05-07T20:24:55.1030335Z 
2025-05-07T20:24:55.1030340Z 
2025-05-07T20:24:55.1030345Z 
2025-05-07T20:24:55.1033194Z 
2025-05-07T20:24:55.1202771Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.1203173Z 
2025-05-07T20:24:55.1203179Z 
2025-05-07T20:24:55.1203185Z 
2025-05-07T20:24:55.1203190Z 
2025-05-07T20:24:55.1203195Z 
2025-05-07T20:24:55.1203200Z 
2025-05-07T20:24:55.1203206Z 
2025-05-07T20:24:55.1203211Z 
2025-05-07T20:24:55.1203216Z 
2025-05-07T20:24:55.1203221Z 
2025-05-07T20:24:55.1203226Z 
2025-05-07T20:24:55.1203231Z 
2025-05-07T20:24:55.1203236Z 
2025-05-07T20:24:55.1214553Z libexpat-2.7.0       | 73 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.1214950Z 
2025-05-07T20:24:55.1214956Z 
2025-05-07T20:24:55.1214961Z 
2025-05-07T20:24:55.1214967Z 
2025-05-07T20:24:55.1214972Z 
2025-05-07T20:24:55.1214977Z 
2025-05-07T20:24:55.1214982Z 
2025-05-07T20:24:55.1214995Z 
2025-05-07T20:24:55.1215000Z 
2025-05-07T20:24:55.1215005Z 
2025-05-07T20:24:55.1215011Z 
2025-05-07T20:24:55.1215024Z 
2025-05-07T20:24:55.1215029Z 
2025-05-07T20:24:55.1413147Z libexpat-2.7.0       | 73 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.1413627Z 
2025-05-07T20:24:55.1413641Z 
2025-05-07T20:24:55.1413646Z 
2025-05-07T20:24:55.1413651Z 
2025-05-07T20:24:55.1413656Z 
2025-05-07T20:24:55.1413661Z 
2025-05-07T20:24:55.1413666Z 
2025-05-07T20:24:55.1413671Z 
2025-05-07T20:24:55.1413676Z 
2025-05-07T20:24:55.1413681Z 
2025-05-07T20:24:55.1413686Z 
2025-05-07T20:24:55.1413691Z 
2025-05-07T20:24:55.1413697Z 
2025-05-07T20:24:55.1413716Z 
2025-05-07T20:24:55.1413722Z 
2025-05-07T20:24:55.1413727Z 
2025-05-07T20:24:55.1413732Z 
2025-05-07T20:24:55.1426358Z libuuid-2.38.1       | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.1426778Z 
2025-05-07T20:24:55.1426784Z 
2025-05-07T20:24:55.1426789Z 
2025-05-07T20:24:55.1426794Z 
2025-05-07T20:24:55.1426799Z 
2025-05-07T20:24:55.1426804Z 
2025-05-07T20:24:55.1426809Z 
2025-05-07T20:24:55.1426815Z 
2025-05-07T20:24:55.1426820Z 
2025-05-07T20:24:55.1426835Z 
2025-05-07T20:24:55.1426841Z 
2025-05-07T20:24:55.1426846Z 
2025-05-07T20:24:55.1426859Z 
2025-05-07T20:24:55.1426864Z 
2025-05-07T20:24:55.1426869Z 
2025-05-07T20:24:55.1426875Z 
2025-05-07T20:24:55.1427023Z 
2025-05-07T20:24:55.1649689Z libuuid-2.38.1       | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.1650096Z 
2025-05-07T20:24:55.1650101Z 
2025-05-07T20:24:55.1650107Z 
2025-05-07T20:24:55.1650112Z 
2025-05-07T20:24:55.1650132Z 
2025-05-07T20:24:55.1650138Z 
2025-05-07T20:24:55.1650143Z 
2025-05-07T20:24:55.1650148Z 
2025-05-07T20:24:55.1650153Z 
2025-05-07T20:24:55.1650159Z 
2025-05-07T20:24:55.1650164Z 
2025-05-07T20:24:55.1650181Z 
2025-05-07T20:24:55.1650186Z 
2025-05-07T20:24:55.1650191Z 
2025-05-07T20:24:55.1650195Z 
2025-05-07T20:24:55.1650201Z 
2025-05-07T20:24:55.1650206Z 
2025-05-07T20:24:55.1650211Z 
2025-05-07T20:24:55.1663339Z libnsl-2.0.1         | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.1663769Z 
2025-05-07T20:24:55.1663774Z 
2025-05-07T20:24:55.1663780Z 
2025-05-07T20:24:55.1663785Z 
2025-05-07T20:24:55.1663790Z 
2025-05-07T20:24:55.1663795Z 
2025-05-07T20:24:55.1663800Z 
2025-05-07T20:24:55.1663805Z 
2025-05-07T20:24:55.1663810Z 
2025-05-07T20:24:55.1663815Z 
2025-05-07T20:24:55.1663820Z 
2025-05-07T20:24:55.1663825Z 
2025-05-07T20:24:55.1663830Z 
2025-05-07T20:24:55.1663835Z 
2025-05-07T20:24:55.1663840Z 
2025-05-07T20:24:55.1663845Z 
2025-05-07T20:24:55.1663855Z 
2025-05-07T20:24:55.1665762Z 
2025-05-07T20:24:55.1704600Z libnsl-2.0.1         | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.1705008Z 
2025-05-07T20:24:55.1705014Z 
2025-05-07T20:24:55.1705019Z 
2025-05-07T20:24:55.1705024Z 
2025-05-07T20:24:55.1705042Z 
2025-05-07T20:24:55.1705047Z 
2025-05-07T20:24:55.1705052Z 
2025-05-07T20:24:55.1705057Z 
2025-05-07T20:24:55.1705062Z 
2025-05-07T20:24:55.1705067Z 
2025-05-07T20:24:55.1705072Z 
2025-05-07T20:24:55.1705086Z 
2025-05-07T20:24:55.1705091Z 
2025-05-07T20:24:55.1705096Z 
2025-05-07T20:24:55.1705101Z 
2025-05-07T20:24:55.1705106Z 
2025-05-07T20:24:55.1705111Z 
2025-05-07T20:24:55.1705116Z 
2025-05-07T20:24:55.1705122Z 
2025-05-07T20:24:55.1714016Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.1714396Z 
2025-05-07T20:24:55.1714402Z 
2025-05-07T20:24:55.1714407Z 
2025-05-07T20:24:55.1714412Z 
2025-05-07T20:24:55.1714645Z 
2025-05-07T20:24:55.1714651Z 
2025-05-07T20:24:55.1714656Z 
2025-05-07T20:24:55.1714661Z 
2025-05-07T20:24:55.1714666Z 
2025-05-07T20:24:55.1714671Z 
2025-05-07T20:24:55.1714676Z 
2025-05-07T20:24:55.1714682Z 
2025-05-07T20:24:55.1714687Z 
2025-05-07T20:24:55.1714700Z 
2025-05-07T20:24:55.1714706Z 
2025-05-07T20:24:55.1714711Z 
2025-05-07T20:24:55.1715635Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.1716025Z 
2025-05-07T20:24:55.1716031Z 
2025-05-07T20:24:55.1716231Z 
2025-05-07T20:24:55.1716237Z 
2025-05-07T20:24:55.1716242Z 
2025-05-07T20:24:55.1716247Z 
2025-05-07T20:24:55.1716252Z 
2025-05-07T20:24:55.1716257Z 
2025-05-07T20:24:55.1716262Z 
2025-05-07T20:24:55.1716267Z 
2025-05-07T20:24:55.1716273Z 
2025-05-07T20:24:55.1716284Z 
2025-05-07T20:24:55.1716289Z 
2025-05-07T20:24:55.1716295Z 
2025-05-07T20:24:55.1716300Z 
2025-05-07T20:24:55.1716305Z 
2025-05-07T20:24:55.1992031Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.1992618Z python-3.12.2        | 30.8 MB   | ########## | 100% 
2025-05-07T20:24:55.8902408Z python-3.12.2        | 30.8 MB   | ########## | 100% 
2025-05-07T20:24:55.8908626Z python-3.12.2        | 30.8 MB   | ########## | 100% 
2025-05-07T20:24:55.8908969Z 
2025-05-07T20:24:55.8908975Z 
2025-05-07T20:24:55.8908980Z 
2025-05-07T20:24:55.8908986Z 
2025-05-07T20:24:55.8908992Z 
2025-05-07T20:24:55.8908999Z 
2025-05-07T20:24:55.8909005Z 
2025-05-07T20:24:55.8909026Z 
2025-05-07T20:24:55.8909031Z 
2025-05-07T20:24:55.8909037Z 
2025-05-07T20:24:55.8909043Z 
2025-05-07T20:24:55.8909058Z 
2025-05-07T20:24:55.8909063Z 
2025-05-07T20:24:55.8909068Z 
2025-05-07T20:24:55.8909072Z 
2025-05-07T20:24:55.8909077Z 
2025-05-07T20:24:55.8909082Z 
2025-05-07T20:24:55.8909087Z 
2025-05-07T20:24:55.8909092Z 
2025-05-07T20:24:55.8909212Z                       
2025-05-07T20:24:55.8909693Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.8910141Z                                                      
2025-05-07T20:24:55.8910407Z 
2025-05-07T20:24:55.8910641Z                                                      [A
2025-05-07T20:24:55.8910919Z 
2025-05-07T20:24:55.8910925Z 
2025-05-07T20:24:55.8911170Z                                                      [A[A
2025-05-07T20:24:55.8911453Z 
2025-05-07T20:24:55.8911459Z 
2025-05-07T20:24:55.8911465Z 
2025-05-07T20:24:55.8911696Z                                                      [A[A[A
2025-05-07T20:24:55.8911996Z 
2025-05-07T20:24:55.8912002Z 
2025-05-07T20:24:55.8912007Z 
2025-05-07T20:24:55.8912012Z 
2025-05-07T20:24:55.8912248Z                                                      [A[A[A[A
2025-05-07T20:24:55.8912539Z 
2025-05-07T20:24:55.8912544Z 
2025-05-07T20:24:55.8912550Z 
2025-05-07T20:24:55.8912555Z 
2025-05-07T20:24:55.8912560Z 
2025-05-07T20:24:55.8912793Z                                                      [A[A[A[A[A
2025-05-07T20:24:55.8913084Z 
2025-05-07T20:24:55.8913097Z 
2025-05-07T20:24:55.8913102Z 
2025-05-07T20:24:55.8913107Z 
2025-05-07T20:24:55.8913112Z 
2025-05-07T20:24:55.8913117Z 
2025-05-07T20:24:55.8913358Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:55.8913656Z 
2025-05-07T20:24:55.8913661Z 
2025-05-07T20:24:55.8913666Z 
2025-05-07T20:24:55.8913672Z 
2025-05-07T20:24:55.8913677Z 
2025-05-07T20:24:55.8913682Z 
2025-05-07T20:24:55.8913687Z 
2025-05-07T20:24:55.8913958Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:55.8914270Z 
2025-05-07T20:24:55.8914275Z 
2025-05-07T20:24:55.8914280Z 
2025-05-07T20:24:55.8914285Z 
2025-05-07T20:24:55.8914291Z 
2025-05-07T20:24:55.8914296Z 
2025-05-07T20:24:55.8914301Z 
2025-05-07T20:24:55.8914306Z 
2025-05-07T20:24:55.8914567Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:55.8914861Z 
2025-05-07T20:24:55.8914866Z 
2025-05-07T20:24:55.8914871Z 
2025-05-07T20:24:55.8914876Z 
2025-05-07T20:24:55.8915105Z 
2025-05-07T20:24:55.8915112Z 
2025-05-07T20:24:55.8915116Z 
2025-05-07T20:24:55.8915121Z 
2025-05-07T20:24:55.8915126Z 
2025-05-07T20:24:55.8915391Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.8915615Z 
2025-05-07T20:24:55.8915618Z 
2025-05-07T20:24:55.8915622Z 
2025-05-07T20:24:55.8915625Z 
2025-05-07T20:24:55.8915629Z 
2025-05-07T20:24:55.8915632Z 
2025-05-07T20:24:55.8915636Z 
2025-05-07T20:24:55.8915639Z 
2025-05-07T20:24:55.8915643Z 
2025-05-07T20:24:55.8915775Z 
2025-05-07T20:24:55.8915978Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.8916205Z 
2025-05-07T20:24:55.8916209Z 
2025-05-07T20:24:55.8916212Z 
2025-05-07T20:24:55.8916216Z 
2025-05-07T20:24:55.8916219Z 
2025-05-07T20:24:55.8916223Z 
2025-05-07T20:24:55.8916226Z 
2025-05-07T20:24:55.8916230Z 
2025-05-07T20:24:55.8916239Z 
2025-05-07T20:24:55.8916243Z 
2025-05-07T20:24:55.8916246Z 
2025-05-07T20:24:55.8916444Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.8916663Z 
2025-05-07T20:24:55.8916667Z 
2025-05-07T20:24:55.8916671Z 
2025-05-07T20:24:55.8916674Z 
2025-05-07T20:24:55.8916688Z 
2025-05-07T20:24:55.8916692Z 
2025-05-07T20:24:55.8916696Z 
2025-05-07T20:24:55.8916699Z 
2025-05-07T20:24:55.8916703Z 
2025-05-07T20:24:55.8916706Z 
2025-05-07T20:24:55.8916710Z 
2025-05-07T20:24:55.8916713Z 
2025-05-07T20:24:55.8916907Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.8917142Z 
2025-05-07T20:24:55.8917145Z 
2025-05-07T20:24:55.8917149Z 
2025-05-07T20:24:55.8917152Z 
2025-05-07T20:24:55.8917156Z 
2025-05-07T20:24:55.8917160Z 
2025-05-07T20:24:55.8917163Z 
2025-05-07T20:24:55.8917167Z 
2025-05-07T20:24:55.8917170Z 
2025-05-07T20:24:55.8917174Z 
2025-05-07T20:24:55.8917177Z 
2025-05-07T20:24:55.8917181Z 
2025-05-07T20:24:55.8917184Z 
2025-05-07T20:24:55.8917387Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.8917617Z 
2025-05-07T20:24:55.8917621Z 
2025-05-07T20:24:55.8917624Z 
2025-05-07T20:24:55.8917628Z 
2025-05-07T20:24:55.8917631Z 
2025-05-07T20:24:55.8917635Z 
2025-05-07T20:24:55.8917639Z 
2025-05-07T20:24:55.8917642Z 
2025-05-07T20:24:55.8917646Z 
2025-05-07T20:24:55.8917649Z 
2025-05-07T20:24:55.8917653Z 
2025-05-07T20:24:55.8917657Z 
2025-05-07T20:24:55.8917660Z 
2025-05-07T20:24:55.8917664Z 
2025-05-07T20:24:55.8917874Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.8918104Z 
2025-05-07T20:24:55.8918108Z 
2025-05-07T20:24:55.8918112Z 
2025-05-07T20:24:55.8918115Z 
2025-05-07T20:24:55.8918119Z 
2025-05-07T20:24:55.8918123Z 
2025-05-07T20:24:55.8918126Z 
2025-05-07T20:24:55.8918130Z 
2025-05-07T20:24:55.8918139Z 
2025-05-07T20:24:55.8918143Z 
2025-05-07T20:24:55.8918146Z 
2025-05-07T20:24:55.8918150Z 
2025-05-07T20:24:55.8918154Z 
2025-05-07T20:24:55.8918157Z 
2025-05-07T20:24:55.8918166Z 
2025-05-07T20:24:55.8918452Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.8918738Z 
2025-05-07T20:24:55.8918743Z 
2025-05-07T20:24:55.8918747Z 
2025-05-07T20:24:55.8918752Z 
2025-05-07T20:24:55.8918756Z 
2025-05-07T20:24:55.8918761Z 
2025-05-07T20:24:55.8918765Z 
2025-05-07T20:24:55.8918770Z 
2025-05-07T20:24:55.8918774Z 
2025-05-07T20:24:55.8918779Z 
2025-05-07T20:24:55.8918783Z 
2025-05-07T20:24:55.8918796Z 
2025-05-07T20:24:55.8918806Z 
2025-05-07T20:24:55.8918810Z 
2025-05-07T20:24:55.8918815Z 
2025-05-07T20:24:55.8918819Z 
2025-05-07T20:24:55.8919083Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.8919376Z 
2025-05-07T20:24:55.8919380Z 
2025-05-07T20:24:55.8919392Z 
2025-05-07T20:24:55.8919396Z 
2025-05-07T20:24:55.8919401Z 
2025-05-07T20:24:55.8919405Z 
2025-05-07T20:24:55.8919410Z 
2025-05-07T20:24:55.8919414Z 
2025-05-07T20:24:55.8919533Z 
2025-05-07T20:24:55.8919539Z 
2025-05-07T20:24:55.8919543Z 
2025-05-07T20:24:55.8919548Z 
2025-05-07T20:24:55.8919552Z 
2025-05-07T20:24:55.8919557Z 
2025-05-07T20:24:55.8919561Z 
2025-05-07T20:24:55.8919566Z 
2025-05-07T20:24:55.8919570Z 
2025-05-07T20:24:55.8919845Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.8920145Z 
2025-05-07T20:24:55.8920150Z 
2025-05-07T20:24:55.8920154Z 
2025-05-07T20:24:55.8920159Z 
2025-05-07T20:24:55.8920255Z 
2025-05-07T20:24:55.8920260Z 
2025-05-07T20:24:55.8920264Z 
2025-05-07T20:24:55.8920269Z 
2025-05-07T20:24:55.8920273Z 
2025-05-07T20:24:55.8920278Z 
2025-05-07T20:24:55.8920282Z 
2025-05-07T20:24:55.8920287Z 
2025-05-07T20:24:55.8920291Z 
2025-05-07T20:24:55.8920296Z 
2025-05-07T20:24:55.8920300Z 
2025-05-07T20:24:55.8920314Z 
2025-05-07T20:24:55.8920318Z 
2025-05-07T20:24:55.8920326Z 
2025-05-07T20:24:55.8920610Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.8920907Z 
2025-05-07T20:24:55.8921015Z  done
2025-05-07T20:24:55.9920019Z Preparing transaction: \ done
2025-05-07T20:24:56.7516922Z Verifying transaction: / - \ | / - \ done
2025-05-07T20:24:58.3548861Z Executing transaction: / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:24:58.7080937Z [SETUP] Testing pyOpenSSL import ...
2025-05-07T20:25:00.4524721Z [CHECK] Python (sub-)package 'OpenSSL' found ...
2025-05-07T20:25:00.4537208Z [SETUP] Installing libxcrypt ...
2025-05-07T20:25:00.4560602Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt
2025-05-07T20:25:01.3202449Z Channels:
2025-05-07T20:25:01.3202694Z  - conda-forge
2025-05-07T20:25:01.3202938Z Platform: linux-64
2025-05-07T20:25:04.7570057Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:25:05.1247920Z Solving environment: \ done
2025-05-07T20:25:05.1613586Z 
2025-05-07T20:25:05.1614139Z # All requested packages already installed.
2025-05-07T20:25:05.1614405Z 
2025-05-07T20:25:08.5298721Z [SETUP] Copying <crypt.h> over ...
2025-05-07T20:25:08.5300067Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.12/crypt.h
2025-05-07T20:25:08.5301147Z 
2025-05-07T20:25:08.5327409Z 
2025-05-07T20:25:10.1700784Z [SETUP] Installed Python version: Python 3.12.2
2025-05-07T20:25:10.1701239Z [SETUP] Successfully created Conda environment: build_binary
2025-05-07T20:25:10.1734615Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc
2025-05-07T20:25:10.1735080Z [36;1m. $PRELUDE; install_cxx_compiler $BUILD_ENV gcc[0m
2025-05-07T20:25:10.1748807Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:10.1749158Z env:
2025-05-07T20:25:10.1749388Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:10.1749682Z   BUILD_ENV: build_binary
2025-05-07T20:25:10.1749933Z   BUILD_TARGET: genai
2025-05-07T20:25:10.1750162Z   BUILD_VARIANT: cuda
2025-05-07T20:25:10.1750395Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:25:10.1750652Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:10.1750956Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:10.1751280Z ##[endgroup]
2025-05-07T20:25:10.5085903Z ################################################################################
2025-05-07T20:25:10.5086257Z # Install C/C++ Compilers
2025-05-07T20:25:10.5086504Z #
2025-05-07T20:25:10.5102769Z # [2025-05-07T20:25:10.509Z] + install_cxx_compiler build_binary gcc
2025-05-07T20:25:10.5103174Z ################################################################################
2025-05-07T20:25:10.5103395Z 
2025-05-07T20:25:10.5120023Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:10.6161933Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:10.6172557Z [INSTALL] Installing GLIBC (architecture = 64) ...
2025-05-07T20:25:10.6195137Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17
2025-05-07T20:25:11.4878798Z Channels:
2025-05-07T20:25:11.4879422Z  - conda-forge
2025-05-07T20:25:11.4879977Z Platform: linux-64
2025-05-07T20:25:14.8256426Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:25:15.1964556Z Solving environment: \ done
2025-05-07T20:25:15.2604635Z 
2025-05-07T20:25:15.2605110Z ## Package Plan ##
2025-05-07T20:25:15.2605294Z 
2025-05-07T20:25:15.2605838Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:15.2606161Z 
2025-05-07T20:25:15.2606285Z   added / updated specs:
2025-05-07T20:25:15.2606552Z     - sysroot_linux-64=2.17
2025-05-07T20:25:15.2606732Z 
2025-05-07T20:25:15.2606736Z 
2025-05-07T20:25:15.2606859Z The following packages will be downloaded:
2025-05-07T20:25:15.2607072Z 
2025-05-07T20:25:15.2607196Z     package                    |            build
2025-05-07T20:25:15.2607514Z     ---------------------------|-----------------
2025-05-07T20:25:15.2607930Z     kernel-headers_linux-64-3.10.0|      he073ed8_18         921 KB  conda-forge
2025-05-07T20:25:15.2608418Z     sysroot_linux-64-2.17      |      h0157908_18        14.5 MB  conda-forge
2025-05-07T20:25:15.2608827Z     ------------------------------------------------------------
2025-05-07T20:25:15.2609164Z                                            Total:        15.4 MB
2025-05-07T20:25:15.2609377Z 
2025-05-07T20:25:15.2609503Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:15.2609730Z 
2025-05-07T20:25:15.2610020Z   kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 
2025-05-07T20:25:15.2610576Z   sysroot_linux-64   conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 
2025-05-07T20:25:15.2610878Z 
2025-05-07T20:25:15.2610882Z 
2025-05-07T20:25:15.2610887Z 
2025-05-07T20:25:15.2611031Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:15.2611400Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:25:15.2611624Z 
2025-05-07T20:25:15.4784717Z kernel-headers_linux | 921 KB    |            |   0% [A
2025-05-07T20:25:15.4833167Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:25:15.4833510Z 
2025-05-07T20:25:15.4959713Z kernel-headers_linux | 921 KB    | 1          |   2% [A
2025-05-07T20:25:15.4960078Z 
2025-05-07T20:25:15.5847843Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:15.7174506Z sysroot_linux-64-2.1 | 14.5 MB   | ######3    |  64% 
2025-05-07T20:25:15.7175032Z 
2025-05-07T20:25:15.7175440Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:15.7175706Z 
2025-05-07T20:25:15.7650964Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:15.7790714Z sysroot_linux-64-2.1 | 14.5 MB   | #########8 |  99% 
2025-05-07T20:25:16.3269428Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:25:16.3272989Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:25:16.3273331Z                                                      
2025-05-07T20:25:16.3273626Z 
2025-05-07T20:25:16.3273916Z                                                      [A done
2025-05-07T20:25:16.4277356Z Preparing transaction: / done
2025-05-07T20:25:16.6288437Z Verifying transaction: \ | done
2025-05-07T20:25:16.8368070Z Executing transaction: - \ done
2025-05-07T20:25:16.9922442Z [CHECK] LD_LIBRARY_PATH = 
2025-05-07T20:25:16.9922773Z [CHECK] CONDA_PREFIX is not set.
2025-05-07T20:25:18.6663299Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6
2025-05-07T20:25:18.6675879Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ...
2025-05-07T20:25:18.6697718Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0
2025-05-07T20:25:19.5603611Z Channels:
2025-05-07T20:25:19.5603854Z  - conda-forge
2025-05-07T20:25:19.5604088Z Platform: linux-64
2025-05-07T20:25:22.8347955Z Collecting package metadata (repodata.json): - \ | / - done
2025-05-07T20:25:23.7957870Z Solving environment: | / - done
2025-05-07T20:25:23.8615599Z 
2025-05-07T20:25:23.8616096Z ## Package Plan ##
2025-05-07T20:25:23.8616414Z 
2025-05-07T20:25:23.8616819Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:23.8617409Z 
2025-05-07T20:25:23.8617618Z   added / updated specs:
2025-05-07T20:25:23.8618178Z     - gxx_linux-64=11.4.0
2025-05-07T20:25:23.8618986Z 
2025-05-07T20:25:23.8619011Z 
2025-05-07T20:25:23.8619268Z The following packages will be downloaded:
2025-05-07T20:25:23.8619618Z 
2025-05-07T20:25:23.8619763Z     package                    |            build
2025-05-07T20:25:23.8620111Z     ---------------------------|-----------------
2025-05-07T20:25:23.8620512Z     binutils_impl_linux-64-2.40|       ha1999f0_7         6.0 MB  conda-forge
2025-05-07T20:25:23.8620998Z     binutils_linux-64-2.40     |       hb3c18ed_4          28 KB  conda-forge
2025-05-07T20:25:23.8621461Z     gcc_impl_linux-64-11.4.0   |      h00c12a0_13        53.0 MB  conda-forge
2025-05-07T20:25:23.8621904Z     gcc_linux-64-11.4.0        |       ha077dfb_4          31 KB  conda-forge
2025-05-07T20:25:23.8622341Z     gxx_impl_linux-64-11.4.0   |      h634f3ee_13        11.2 MB  conda-forge
2025-05-07T20:25:23.8622780Z     gxx_linux-64-11.4.0        |       h35bfe5d_4          29 KB  conda-forge
2025-05-07T20:25:23.8623215Z     ld_impl_linux-64-2.40      |       hf3520f5_7         691 KB  conda-forge
2025-05-07T20:25:23.8623695Z     libgcc-devel_linux-64-11.4.0|     h8f596e0_113         2.3 MB  conda-forge
2025-05-07T20:25:23.8624165Z     libsanitizer-11.4.0        |      h5763a12_13         3.5 MB  conda-forge
2025-05-07T20:25:23.8624607Z     libstdcxx-15.1.0           |       h8f9b012_2         3.7 MB  conda-forge
2025-05-07T20:25:23.8625082Z     libstdcxx-devel_linux-64-11.4.0|     h8f596e0_113        11.1 MB  conda-forge
2025-05-07T20:25:23.8625554Z     libstdcxx-ng-15.1.0        |       h4852527_2          34 KB  conda-forge
2025-05-07T20:25:23.8625967Z     ------------------------------------------------------------
2025-05-07T20:25:23.8626317Z                                            Total:        91.6 MB
2025-05-07T20:25:23.8626529Z 
2025-05-07T20:25:23.8626668Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:23.8626887Z 
2025-05-07T20:25:23.8627163Z   binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 
2025-05-07T20:25:23.8627922Z   binutils_linux-64  conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 
2025-05-07T20:25:23.8628460Z   gcc_impl_linux-64  conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 
2025-05-07T20:25:23.8628964Z   gcc_linux-64       conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 
2025-05-07T20:25:23.8629457Z   gxx_impl_linux-64  conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 
2025-05-07T20:25:23.8629957Z   gxx_linux-64       conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 
2025-05-07T20:25:23.8630482Z   libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:25:23.8631037Z   libsanitizer       conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 
2025-05-07T20:25:23.8631524Z   libstdcxx          conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 
2025-05-07T20:25:23.8632061Z   libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:25:23.8632415Z 
2025-05-07T20:25:23.8632543Z The following packages will be UPDATED:
2025-05-07T20:25:23.8632757Z 
2025-05-07T20:25:23.8633069Z   ld_impl_linux-64   pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 
2025-05-07T20:25:23.8633771Z   libstdcxx-ng       pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 
2025-05-07T20:25:23.8634176Z 
2025-05-07T20:25:23.8634180Z 
2025-05-07T20:25:23.8634184Z 
2025-05-07T20:25:23.8634327Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:23.8634705Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:25:23.8634928Z 
2025-05-07T20:25:23.8635270Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:25:23.8635503Z 
2025-05-07T20:25:23.8635507Z 
2025-05-07T20:25:23.8637906Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:25:23.8638168Z 
2025-05-07T20:25:23.8638172Z 
2025-05-07T20:25:23.8644039Z 
2025-05-07T20:25:23.8659821Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:25:23.8660258Z 
2025-05-07T20:25:23.8660262Z 
2025-05-07T20:25:23.8660266Z 
2025-05-07T20:25:23.8663711Z 
2025-05-07T20:25:23.8690521Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:25:23.8690902Z 
2025-05-07T20:25:23.8690907Z 
2025-05-07T20:25:23.8690912Z 
2025-05-07T20:25:23.8690917Z 
2025-05-07T20:25:23.8690922Z 
2025-05-07T20:25:23.8692323Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:25:23.8692650Z 
2025-05-07T20:25:23.8692654Z 
2025-05-07T20:25:23.8692658Z 
2025-05-07T20:25:23.8692661Z 
2025-05-07T20:25:23.8692665Z 
2025-05-07T20:25:23.8692668Z 
2025-05-07T20:25:23.8694880Z libgcc-devel_linux-6 | 2.3 MB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:23.8695182Z 
2025-05-07T20:25:23.8695186Z 
2025-05-07T20:25:23.8695204Z 
2025-05-07T20:25:23.8695207Z 
2025-05-07T20:25:23.8695211Z 
2025-05-07T20:25:23.8695215Z 
2025-05-07T20:25:23.8695219Z 
2025-05-07T20:25:23.8696811Z ld_impl_linux-64-2.4 | 691 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:23.8697144Z 
2025-05-07T20:25:23.8697148Z 
2025-05-07T20:25:23.8697152Z 
2025-05-07T20:25:23.8697156Z 
2025-05-07T20:25:23.8697160Z 
2025-05-07T20:25:23.8697164Z 
2025-05-07T20:25:23.8697167Z 
2025-05-07T20:25:23.8697171Z 
2025-05-07T20:25:23.8698714Z libstdcxx-ng-15.1.0  | 34 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:23.8699096Z 
2025-05-07T20:25:23.8699100Z 
2025-05-07T20:25:23.8699104Z 
2025-05-07T20:25:23.8699108Z 
2025-05-07T20:25:23.8699111Z 
2025-05-07T20:25:23.8699115Z 
2025-05-07T20:25:23.8699119Z 
2025-05-07T20:25:23.8699123Z 
2025-05-07T20:25:23.8699126Z 
2025-05-07T20:25:23.8700585Z gcc_linux-64-11.4.0  | 31 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.8700907Z 
2025-05-07T20:25:23.8700911Z 
2025-05-07T20:25:23.8700914Z 
2025-05-07T20:25:23.8700918Z 
2025-05-07T20:25:23.8700922Z 
2025-05-07T20:25:23.8700925Z 
2025-05-07T20:25:23.8700929Z 
2025-05-07T20:25:23.8700933Z 
2025-05-07T20:25:23.8700947Z 
2025-05-07T20:25:23.8701205Z 
2025-05-07T20:25:23.8702601Z gxx_linux-64-11.4.0  | 29 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.8702973Z 
2025-05-07T20:25:23.8702978Z 
2025-05-07T20:25:23.8702982Z 
2025-05-07T20:25:23.8702985Z 
2025-05-07T20:25:23.8702989Z 
2025-05-07T20:25:23.8702993Z 
2025-05-07T20:25:23.8702997Z 
2025-05-07T20:25:23.8703000Z 
2025-05-07T20:25:23.8703004Z 
2025-05-07T20:25:23.8703015Z 
2025-05-07T20:25:23.8703019Z 
2025-05-07T20:25:24.0284287Z binutils_linux-64-2. | 28 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:24.0284599Z 
2025-05-07T20:25:24.0284604Z 
2025-05-07T20:25:24.0286525Z 
2025-05-07T20:25:24.0292789Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:25:24.0293139Z 
2025-05-07T20:25:24.0293143Z 
2025-05-07T20:25:24.0293154Z 
2025-05-07T20:25:24.0293158Z 
2025-05-07T20:25:24.0294700Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:25:24.0294979Z 
2025-05-07T20:25:24.0296655Z 
2025-05-07T20:25:24.0583163Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:25:24.0583433Z 
2025-05-07T20:25:24.0584201Z 
2025-05-07T20:25:24.0584206Z 
2025-05-07T20:25:24.0584320Z 
2025-05-07T20:25:24.0833364Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:24.0833641Z 
2025-05-07T20:25:24.1172784Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:25:24.1173031Z 
2025-05-07T20:25:24.1173035Z 
2025-05-07T20:25:24.1173039Z 
2025-05-07T20:25:24.1173042Z 
2025-05-07T20:25:24.1173046Z 
2025-05-07T20:25:24.1231969Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:25:24.1284147Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:25:24.1284387Z 
2025-05-07T20:25:24.1284391Z 
2025-05-07T20:25:24.1284395Z 
2025-05-07T20:25:24.1288233Z binutils_impl_linux- | 6.0 MB    | ######     |  60% [A[A[A
2025-05-07T20:25:24.1288512Z 
2025-05-07T20:25:24.1288696Z 
2025-05-07T20:25:24.1836427Z libstdcxx-devel_linu | 11.1 MB   | ##4        |  24% [A[A
2025-05-07T20:25:24.1841124Z 
2025-05-07T20:25:24.2174574Z gxx_impl_linux-64-11 | 11.2 MB   | ######3    |  63% [A
2025-05-07T20:25:24.2174820Z 
2025-05-07T20:25:24.2174823Z 
2025-05-07T20:25:24.2174827Z 
2025-05-07T20:25:24.2174841Z 
2025-05-07T20:25:24.2175063Z 
2025-05-07T20:25:24.2233351Z libsanitizer-11.4.0  | 3.5 MB    | ####9      |  50% [A[A[A[A[A
2025-05-07T20:25:24.2296938Z gcc_impl_linux-64-11 | 53.0 MB   | 5          |   6% 
2025-05-07T20:25:24.2297170Z 
2025-05-07T20:25:24.2298400Z 
2025-05-07T20:25:24.3089520Z libstdcxx-devel_linu | 11.1 MB   | #####1     |  52% [A[A
2025-05-07T20:25:24.3089784Z 
2025-05-07T20:25:24.3089788Z 
2025-05-07T20:25:24.3089791Z 
2025-05-07T20:25:24.3089795Z 
2025-05-07T20:25:24.3095875Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:24.3096190Z 
2025-05-07T20:25:24.3096195Z 
2025-05-07T20:25:24.3096200Z 
2025-05-07T20:25:24.3096206Z 
2025-05-07T20:25:24.3235661Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:24.3298089Z gcc_impl_linux-64-11 | 53.0 MB   | #2         |  13% 
2025-05-07T20:25:24.3298324Z 
2025-05-07T20:25:24.3298328Z 
2025-05-07T20:25:24.3484546Z libstdcxx-devel_linu | 11.1 MB   | ########4  |  84% [A[A
2025-05-07T20:25:24.3484845Z 
2025-05-07T20:25:24.3484927Z 
2025-05-07T20:25:24.3484931Z 
2025-05-07T20:25:24.3484934Z 
2025-05-07T20:25:24.3492023Z 
2025-05-07T20:25:24.3492498Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:24.3492797Z 
2025-05-07T20:25:24.3492802Z 
2025-05-07T20:25:24.3492806Z 
2025-05-07T20:25:24.3492810Z 
2025-05-07T20:25:24.3492813Z 
2025-05-07T20:25:24.3748222Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:24.3748564Z 
2025-05-07T20:25:24.3748568Z 
2025-05-07T20:25:24.3748572Z 
2025-05-07T20:25:24.3753022Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:24.3753306Z 
2025-05-07T20:25:24.3753505Z 
2025-05-07T20:25:24.3754681Z 
2025-05-07T20:25:24.3889903Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:24.3890224Z 
2025-05-07T20:25:24.3890228Z 
2025-05-07T20:25:24.3890232Z 
2025-05-07T20:25:24.3890236Z 
2025-05-07T20:25:24.3890239Z 
2025-05-07T20:25:24.3890243Z 
2025-05-07T20:25:24.4236363Z libgcc-devel_linux-6 | 2.3 MB    |            |   1% [A[A[A[A[A[A
2025-05-07T20:25:24.4351640Z gcc_impl_linux-64-11 | 53.0 MB   | #9         |  19% 
2025-05-07T20:25:24.4351895Z 
2025-05-07T20:25:24.4351899Z 
2025-05-07T20:25:24.4351902Z 
2025-05-07T20:25:24.4351906Z 
2025-05-07T20:25:24.4351910Z 
2025-05-07T20:25:24.4351913Z 
2025-05-07T20:25:24.4351917Z 
2025-05-07T20:25:24.4894838Z ld_impl_linux-64-2.4 | 691 KB    | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:25:24.4895132Z 
2025-05-07T20:25:24.4895180Z 
2025-05-07T20:25:24.4895184Z 
2025-05-07T20:25:24.4895472Z 
2025-05-07T20:25:24.4895480Z 
2025-05-07T20:25:24.4895486Z 
2025-05-07T20:25:24.4895502Z 
2025-05-07T20:25:24.5239537Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:24.5290563Z gcc_impl_linux-64-11 | 53.0 MB   | ##6        |  26% 
2025-05-07T20:25:24.5290804Z 
2025-05-07T20:25:24.5291022Z 
2025-05-07T20:25:24.5291029Z 
2025-05-07T20:25:24.5291035Z 
2025-05-07T20:25:24.5291042Z 
2025-05-07T20:25:24.5291047Z 
2025-05-07T20:25:24.5291052Z 
2025-05-07T20:25:24.5291091Z 
2025-05-07T20:25:24.5350360Z libstdcxx-ng-15.1.0  | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:25:24.5350685Z 
2025-05-07T20:25:24.5350690Z 
2025-05-07T20:25:24.5350694Z 
2025-05-07T20:25:24.5350697Z 
2025-05-07T20:25:24.5350701Z 
2025-05-07T20:25:24.5350705Z 
2025-05-07T20:25:24.5350708Z 
2025-05-07T20:25:24.5350712Z 
2025-05-07T20:25:24.5461886Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:24.5462199Z 
2025-05-07T20:25:24.5462203Z 
2025-05-07T20:25:24.5462282Z 
2025-05-07T20:25:24.5462489Z 
2025-05-07T20:25:24.5462507Z 
2025-05-07T20:25:24.5462511Z 
2025-05-07T20:25:24.5462806Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:24.5463088Z 
2025-05-07T20:25:24.5463092Z 
2025-05-07T20:25:24.5463096Z 
2025-05-07T20:25:24.5463100Z 
2025-05-07T20:25:24.5463104Z 
2025-05-07T20:25:24.5463114Z 
2025-05-07T20:25:24.5654904Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:24.5655198Z 
2025-05-07T20:25:24.5655202Z 
2025-05-07T20:25:24.5655205Z 
2025-05-07T20:25:24.5655209Z 
2025-05-07T20:25:24.5655213Z 
2025-05-07T20:25:24.5655216Z 
2025-05-07T20:25:24.5655220Z 
2025-05-07T20:25:24.5655224Z 
2025-05-07T20:25:24.5655939Z 
2025-05-07T20:25:24.5710664Z gcc_linux-64-11.4.0  | 31 KB     | #####2     |  52% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:24.5710982Z 
2025-05-07T20:25:24.5710986Z 
2025-05-07T20:25:24.5710989Z 
2025-05-07T20:25:24.5711004Z 
2025-05-07T20:25:24.5711007Z 
2025-05-07T20:25:24.5711011Z 
2025-05-07T20:25:24.5711026Z 
2025-05-07T20:25:24.5711034Z 
2025-05-07T20:25:24.5712347Z 
2025-05-07T20:25:24.5845118Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:24.5845418Z 
2025-05-07T20:25:24.5845422Z 
2025-05-07T20:25:24.5845426Z 
2025-05-07T20:25:24.5845429Z 
2025-05-07T20:25:24.5845433Z 
2025-05-07T20:25:24.5845437Z 
2025-05-07T20:25:24.5845440Z 
2025-05-07T20:25:24.5845444Z 
2025-05-07T20:25:24.5845447Z 
2025-05-07T20:25:24.5846902Z 
2025-05-07T20:25:24.5880456Z gxx_linux-64-11.4.0  | 29 KB     | #####5     |  55% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:24.5880783Z 
2025-05-07T20:25:24.5880787Z 
2025-05-07T20:25:24.5880790Z 
2025-05-07T20:25:24.5880794Z 
2025-05-07T20:25:24.5880797Z 
2025-05-07T20:25:24.5880801Z 
2025-05-07T20:25:24.5880804Z 
2025-05-07T20:25:24.5880808Z 
2025-05-07T20:25:24.5880811Z 
2025-05-07T20:25:24.5883713Z 
2025-05-07T20:25:24.6062631Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:24.6062937Z 
2025-05-07T20:25:24.6063124Z 
2025-05-07T20:25:24.6063129Z 
2025-05-07T20:25:24.6063132Z 
2025-05-07T20:25:24.6063136Z 
2025-05-07T20:25:24.6063140Z 
2025-05-07T20:25:24.6063143Z 
2025-05-07T20:25:24.6063155Z 
2025-05-07T20:25:24.6063158Z 
2025-05-07T20:25:24.6063162Z 
2025-05-07T20:25:24.6063166Z 
2025-05-07T20:25:24.6119110Z binutils_linux-64-2. | 28 KB     | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:24.6119467Z 
2025-05-07T20:25:24.6119478Z 
2025-05-07T20:25:24.6119482Z 
2025-05-07T20:25:24.6119486Z 
2025-05-07T20:25:24.6119489Z 
2025-05-07T20:25:24.6119493Z 
2025-05-07T20:25:24.6119497Z 
2025-05-07T20:25:24.6119500Z 
2025-05-07T20:25:24.6119504Z 
2025-05-07T20:25:24.6119508Z 
2025-05-07T20:25:24.6121770Z 
2025-05-07T20:25:24.6200776Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:24.6201076Z 
2025-05-07T20:25:24.6242337Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:25:24.6488557Z gcc_impl_linux-64-11 | 53.0 MB   | ###3       |  33% 
2025-05-07T20:25:24.6488799Z 
2025-05-07T20:25:24.6488804Z 
2025-05-07T20:25:24.6488807Z 
2025-05-07T20:25:24.6488811Z 
2025-05-07T20:25:24.6488815Z 
2025-05-07T20:25:24.6488825Z 
2025-05-07T20:25:24.6490993Z 
2025-05-07T20:25:24.6502416Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:24.6502700Z 
2025-05-07T20:25:24.6502711Z 
2025-05-07T20:25:24.6502715Z 
2025-05-07T20:25:24.6502719Z 
2025-05-07T20:25:24.6502722Z 
2025-05-07T20:25:24.6502726Z 
2025-05-07T20:25:24.6503732Z 
2025-05-07T20:25:24.6813017Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:24.6813401Z 
2025-05-07T20:25:24.6813935Z 
2025-05-07T20:25:24.7245070Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:25:24.7291029Z gcc_impl_linux-64-11 | 53.0 MB   | ####5      |  45% 
2025-05-07T20:25:24.7291287Z 
2025-05-07T20:25:24.7291292Z 
2025-05-07T20:25:24.7291296Z 
2025-05-07T20:25:24.7291487Z 
2025-05-07T20:25:24.7291495Z 
2025-05-07T20:25:24.7291499Z 
2025-05-07T20:25:24.7291502Z 
2025-05-07T20:25:24.7291576Z 
2025-05-07T20:25:24.7297138Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:24.7297434Z 
2025-05-07T20:25:24.7297438Z 
2025-05-07T20:25:24.7297442Z 
2025-05-07T20:25:24.7297446Z 
2025-05-07T20:25:24.7297449Z 
2025-05-07T20:25:24.7297453Z 
2025-05-07T20:25:24.7297457Z 
2025-05-07T20:25:24.7297460Z 
2025-05-07T20:25:24.8041325Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:24.8041661Z 
2025-05-07T20:25:24.8041665Z 
2025-05-07T20:25:24.8041669Z 
2025-05-07T20:25:24.8041672Z 
2025-05-07T20:25:24.8041787Z 
2025-05-07T20:25:24.8246632Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:24.8604741Z gcc_impl_linux-64-11 | 53.0 MB   | #####8     |  58% 
2025-05-07T20:25:24.8604977Z 
2025-05-07T20:25:24.8604981Z 
2025-05-07T20:25:24.8604992Z 
2025-05-07T20:25:24.8605006Z 
2025-05-07T20:25:24.8605014Z 
2025-05-07T20:25:24.8605018Z 
2025-05-07T20:25:24.8605021Z 
2025-05-07T20:25:24.8605025Z 
2025-05-07T20:25:24.8605442Z 
2025-05-07T20:25:24.8610470Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:24.8610754Z 
2025-05-07T20:25:24.8610758Z 
2025-05-07T20:25:24.8610762Z 
2025-05-07T20:25:24.8610766Z 
2025-05-07T20:25:24.8610769Z 
2025-05-07T20:25:24.8610773Z 
2025-05-07T20:25:24.8610777Z 
2025-05-07T20:25:24.8610780Z 
2025-05-07T20:25:24.8611392Z 
2025-05-07T20:25:24.8921944Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:24.8922218Z 
2025-05-07T20:25:24.8922222Z 
2025-05-07T20:25:24.8922225Z 
2025-05-07T20:25:24.8922229Z 
2025-05-07T20:25:24.8922233Z 
2025-05-07T20:25:24.8922236Z 
2025-05-07T20:25:24.8922240Z 
2025-05-07T20:25:24.8922244Z 
2025-05-07T20:25:24.8922247Z 
2025-05-07T20:25:24.8922251Z 
2025-05-07T20:25:24.8930631Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:24.8930921Z 
2025-05-07T20:25:24.8930925Z 
2025-05-07T20:25:24.8930928Z 
2025-05-07T20:25:24.8930932Z 
2025-05-07T20:25:24.8930936Z 
2025-05-07T20:25:24.8930939Z 
2025-05-07T20:25:24.8930950Z 
2025-05-07T20:25:24.8930953Z 
2025-05-07T20:25:24.8930957Z 
2025-05-07T20:25:24.8932492Z 
2025-05-07T20:25:24.8953246Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:24.8953538Z 
2025-05-07T20:25:24.8953541Z 
2025-05-07T20:25:24.8953545Z 
2025-05-07T20:25:24.8953549Z 
2025-05-07T20:25:24.8953552Z 
2025-05-07T20:25:24.8953556Z 
2025-05-07T20:25:24.9313237Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:24.9313529Z 
2025-05-07T20:25:24.9313743Z 
2025-05-07T20:25:24.9313750Z 
2025-05-07T20:25:24.9313997Z 
2025-05-07T20:25:24.9314057Z 
2025-05-07T20:25:24.9314067Z 
2025-05-07T20:25:24.9314083Z 
2025-05-07T20:25:24.9314093Z 
2025-05-07T20:25:24.9314126Z 
2025-05-07T20:25:24.9314153Z 
2025-05-07T20:25:24.9314170Z 
2025-05-07T20:25:24.9317928Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:24.9318706Z 
2025-05-07T20:25:24.9318715Z 
2025-05-07T20:25:24.9318734Z 
2025-05-07T20:25:24.9318742Z 
2025-05-07T20:25:24.9318749Z 
2025-05-07T20:25:24.9318756Z 
2025-05-07T20:25:24.9318764Z 
2025-05-07T20:25:24.9318771Z 
2025-05-07T20:25:24.9318778Z 
2025-05-07T20:25:24.9318786Z 
2025-05-07T20:25:24.9318793Z 
2025-05-07T20:25:24.9807161Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:25.0807211Z gcc_impl_linux-64-11 | 53.0 MB   | ######7    |  68% 
2025-05-07T20:25:25.1347659Z gcc_impl_linux-64-11 | 53.0 MB   | ########3  |  84% 
2025-05-07T20:25:25.1347908Z 
2025-05-07T20:25:25.1347912Z 
2025-05-07T20:25:25.1349807Z 
2025-05-07T20:25:25.1808972Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:25.2837586Z gcc_impl_linux-64-11 | 53.0 MB   | #########7 |  97% 
2025-05-07T20:25:25.2838312Z 
2025-05-07T20:25:25.3144745Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:25:25.6272418Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:25:25.6272720Z 
2025-05-07T20:25:25.6272724Z 
2025-05-07T20:25:26.0644185Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:25:26.0650803Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:25:26.0651299Z                                                      
2025-05-07T20:25:26.0651576Z 
2025-05-07T20:25:26.0651813Z                                                      [A
2025-05-07T20:25:26.0652019Z 
2025-05-07T20:25:26.0652024Z 
2025-05-07T20:25:26.0652195Z                                                      [A[A
2025-05-07T20:25:26.0652401Z 
2025-05-07T20:25:26.0652405Z 
2025-05-07T20:25:26.0652409Z 
2025-05-07T20:25:26.0652583Z                                                      [A[A[A
2025-05-07T20:25:26.0652795Z 
2025-05-07T20:25:26.0652814Z 
2025-05-07T20:25:26.0652828Z 
2025-05-07T20:25:26.0652832Z 
2025-05-07T20:25:26.0653014Z                                                      [A[A[A[A
2025-05-07T20:25:26.0653396Z 
2025-05-07T20:25:26.0653402Z 
2025-05-07T20:25:26.0653407Z 
2025-05-07T20:25:26.0653412Z 
2025-05-07T20:25:26.0653417Z 
2025-05-07T20:25:26.0653661Z                                                      [A[A[A[A[A
2025-05-07T20:25:26.0653952Z 
2025-05-07T20:25:26.0653958Z 
2025-05-07T20:25:26.0653962Z 
2025-05-07T20:25:26.0653968Z 
2025-05-07T20:25:26.0653973Z 
2025-05-07T20:25:26.0653978Z 
2025-05-07T20:25:26.0654215Z                                                      [A[A[A[A[A[A
2025-05-07T20:25:26.0654507Z 
2025-05-07T20:25:26.0654512Z 
2025-05-07T20:25:26.0654517Z 
2025-05-07T20:25:26.0654522Z 
2025-05-07T20:25:26.0654527Z 
2025-05-07T20:25:26.0654533Z 
2025-05-07T20:25:26.0654538Z 
2025-05-07T20:25:26.0654787Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:25:26.0655037Z 
2025-05-07T20:25:26.0655278Z 
2025-05-07T20:25:26.0655283Z 
2025-05-07T20:25:26.0655286Z 
2025-05-07T20:25:26.0655290Z 
2025-05-07T20:25:26.0655293Z 
2025-05-07T20:25:26.0655297Z 
2025-05-07T20:25:26.0655301Z 
2025-05-07T20:25:26.0655531Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:25:26.0655754Z 
2025-05-07T20:25:26.0655757Z 
2025-05-07T20:25:26.0655761Z 
2025-05-07T20:25:26.0655765Z 
2025-05-07T20:25:26.0655768Z 
2025-05-07T20:25:26.0655772Z 
2025-05-07T20:25:26.0655776Z 
2025-05-07T20:25:26.0655779Z 
2025-05-07T20:25:26.0655783Z 
2025-05-07T20:25:26.0655980Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:26.0656198Z 
2025-05-07T20:25:26.0656202Z 
2025-05-07T20:25:26.0656205Z 
2025-05-07T20:25:26.0656209Z 
2025-05-07T20:25:26.0656213Z 
2025-05-07T20:25:26.0656216Z 
2025-05-07T20:25:26.0656220Z 
2025-05-07T20:25:26.0656224Z 
2025-05-07T20:25:26.0656227Z 
2025-05-07T20:25:26.0656231Z 
2025-05-07T20:25:26.0656438Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:26.0656659Z 
2025-05-07T20:25:26.0656662Z 
2025-05-07T20:25:26.0656666Z 
2025-05-07T20:25:26.0656670Z 
2025-05-07T20:25:26.0656673Z 
2025-05-07T20:25:26.0656677Z 
2025-05-07T20:25:26.0656680Z 
2025-05-07T20:25:26.0656684Z 
2025-05-07T20:25:26.0656694Z 
2025-05-07T20:25:26.0656698Z 
2025-05-07T20:25:26.0656702Z 
2025-05-07T20:25:26.0656906Z                                                      [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:25:26.1658548Z Preparing transaction: | done
2025-05-07T20:25:26.4673071Z Verifying transaction: - \ | done
2025-05-07T20:25:26.5683105Z Executing transaction: - done
2025-05-07T20:25:26.7318417Z [INSTALL] Setting the C/C++ compiler symlinks ...
2025-05-07T20:25:30.6396238Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:25:30.6396809Z 
2025-05-07T20:25:30.6407428Z 
2025-05-07T20:25:30.6426521Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:30.6427182Z 
2025-05-07T20:25:30.6439122Z 
2025-05-07T20:25:30.6456616Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:30.6457299Z 
2025-05-07T20:25:30.6468627Z 
2025-05-07T20:25:30.6486035Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:30.6486726Z 
2025-05-07T20:25:30.6498839Z 
2025-05-07T20:25:32.5385251Z /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:25:32.5385622Z 
2025-05-07T20:25:32.6008218Z [CHECK] Binary cc found in PATH
2025-05-07T20:25:34.4813660Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:34.4814116Z 
2025-05-07T20:25:34.5436393Z [CHECK] Binary gcc found in PATH
2025-05-07T20:25:36.4274110Z /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:36.4274419Z 
2025-05-07T20:25:36.4908194Z [CHECK] Binary c++ found in PATH
2025-05-07T20:25:38.3803030Z /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:38.3803324Z 
2025-05-07T20:25:38.4432130Z [CHECK] Binary g++ found in PATH
2025-05-07T20:25:38.4436266Z [INFO] Printing out all preprocessor defines in the C compiler ...
2025-05-07T20:25:38.4436686Z + conda run -n build_binary cc -dM -E -
2025-05-07T20:25:38.4436897Z 
2025-05-07T20:25:40.3353629Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:40.3354093Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:40.3354510Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:40.3354903Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:40.3355248Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:40.3355694Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:40.3356096Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:40.3356914Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:40.3357223Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:40.3357484Z #define __CHAR_BIT__ 8
2025-05-07T20:25:40.3357724Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:40.3358046Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:40.3358344Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:40.3358743Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:40.3359092Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:40.3371318Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:40.3371778Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:40.3372180Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:40.3372631Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:40.3373005Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:40.3373517Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:40.3373938Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:40.3374300Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:40.3374593Z #define __GCC_IEC_559 2
2025-05-07T20:25:40.3374853Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:40.3375142Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:40.3375410Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:40.3375703Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:40.3376039Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:40.3376368Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:40.3376643Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:40.3376930Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:40.3377203Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:40.3377471Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:40.3377749Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:40.3378018Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:40.3378280Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:40.3378545Z #define __INT8_C(c) c
2025-05-07T20:25:40.3378793Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:40.3379378Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:40.3379712Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:40.3380040Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:40.3380395Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:40.3380671Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:40.3380944Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:40.3381231Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:40.3381510Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:40.3381916Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:40.3382336Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:40.3382622Z #define __linux 1
2025-05-07T20:25:40.3382859Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:40.3383144Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:40.3383422Z #define __unix 1
2025-05-07T20:25:40.3383657Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:40.3383956Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:40.3384229Z #define __WINT_MIN__ 0U
2025-05-07T20:25:40.3384484Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:40.3384774Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:40.3385056Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:40.3385325Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:40.3385588Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:40.3385878Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:40.3386174Z #define __INT64_C(c) c ## L
2025-05-07T20:25:40.3386449Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:40.3386756Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:40.3387020Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:40.3387374Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:40.3387755Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:40.3388010Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:40.3388280Z #define __DBL_DIG__ 15
2025-05-07T20:25:40.3388698Z #define __FLT32_DIG__ 6
2025-05-07T20:25:40.3389009Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:40.3389366Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:40.3389627Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:40.3389960Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:40.3390301Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:40.3390557Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:40.3390835Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:40.3391211Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:40.3391616Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:40.3391898Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:40.3392155Z #define __unix__ 1
2025-05-07T20:25:40.3392386Z #define __INT_WIDTH__ 32
2025-05-07T20:25:40.3392638Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:40.3392886Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:40.3393153Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:40.3393435Z #define __UINT16_C(c) c
2025-05-07T20:25:40.3393675Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:40.3393943Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:40.3394305Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:40.3394670Z #define __gnu_linux__ 1
2025-05-07T20:25:40.3394916Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:40.3395199Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:40.3395495Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:40.3395761Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:40.3396032Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:40.3396286Z #define __GNUC__ 11
2025-05-07T20:25:40.3396499Z #define __pie__ 2
2025-05-07T20:25:40.3396718Z #define __MMX__ 1
2025-05-07T20:25:40.3396949Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:40.3397216Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:40.3397498Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:40.3397775Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:40.3398226Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:40.3398628Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:40.3398946Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:40.3399211Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:40.3399475Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:40.3399786Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:40.3400049Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:40.3400313Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:40.3400600Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:40.3400894Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:40.3401166Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:40.3401522Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:40.3401778Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:40.3402050Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:40.3402317Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:40.3402580Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:40.3402852Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:40.3403166Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:40.3403526Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:40.3403795Z #define __SSE2_MATH__ 1
2025-05-07T20:25:40.3404043Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:40.3404350Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:40.3404646Z #define __amd64 1
2025-05-07T20:25:40.3404875Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:40.3405149Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:40.3405459Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:40.3405775Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:40.3406032Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:40.3406314Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:40.3406575Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:40.3406835Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:40.3407107Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:40.3407495Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:40.3407766Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:40.3408053Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:40.3408305Z #define __x86_64 1
2025-05-07T20:25:40.3408535Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:40.3408907Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:40.3409366Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:40.3409819Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:40.3410278Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:40.3410663Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:40.3410918Z #define __LP64__ 1
2025-05-07T20:25:40.3411147Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:40.3411496Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:40.3411875Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:40.3412157Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:40.3412437Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:40.3412723Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:40.3412997Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:40.3413399Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:40.3413661Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:40.3413924Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:40.3414188Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:40.3414522Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:40.3414887Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:40.3415162Z #define __FLT_DIG__ 6
2025-05-07T20:25:40.3415397Z #define __NO_INLINE__ 1
2025-05-07T20:25:40.3415644Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:40.3415964Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:40.3416316Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:40.3416579Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:40.3416941Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:40.3417202Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:40.3417467Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:40.3417725Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:40.3418025Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:40.3418313Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:40.3418585Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:40.3418885Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:40.3419214Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:40.3419481Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:40.3419737Z #define __FLT128_DIG__ 33
2025-05-07T20:25:40.3419984Z #define __INT32_C(c) c
2025-05-07T20:25:40.3420239Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:40.3420514Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:40.3420798Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:40.3421089Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:40.3421409Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:40.3421718Z #define unix 1
2025-05-07T20:25:40.3421954Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:40.3422264Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:40.3422574Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:40.3422889Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:40.3423221Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:40.3423476Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:40.3423741Z #define __ELF__ 1
2025-05-07T20:25:40.3423984Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:40.3424263Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:40.3424540Z #define __FLT_RADIX__ 2
2025-05-07T20:25:40.3424797Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:40.3425151Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:40.3425513Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:40.3425779Z #define __SSE_MATH__ 1
2025-05-07T20:25:40.3426175Z #define __k8 1
2025-05-07T20:25:40.3426480Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:40.3426852Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:40.3427144Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:40.3427446Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:40.3427706Z #define __LDBL_DIG__ 18
2025-05-07T20:25:40.3427954Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:40.3428213Z #define __x86_64__ 1
2025-05-07T20:25:40.3428458Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:40.3428759Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:40.3429090Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:40.3429394Z #define __FLT64_DIG__ 15
2025-05-07T20:25:40.3429677Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:40.3430021Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:40.3430339Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:40.3430618Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:40.3430901Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:40.3431203Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:40.3431568Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:40.3431965Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:40.3432254Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:40.3432587Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:40.3432909Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:40.3433207Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:40.3433492Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:40.3433801Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:40.3434084Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:40.3434331Z #define __SEG_FS 1
2025-05-07T20:25:40.3434570Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:40.3434841Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:40.3435123Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:40.3435518Z #define __SEG_GS 1
2025-05-07T20:25:40.3435827Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:40.3436213Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:40.3436492Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:40.3436786Z #define __INT16_TYPE__ short int
2025-05-07T20:25:40.3437065Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:40.3437362Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:40.3437632Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:40.3437879Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:40.3438148Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:40.3438495Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:40.3438878Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:40.3439172Z #define linux 1
2025-05-07T20:25:40.3439405Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:40.3439678Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:40.3439964Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:40.3440218Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:40.3440480Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:40.3440745Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:40.3441092Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:40.3441499Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:40.3441823Z #define __code_model_small__ 1
2025-05-07T20:25:40.3442098Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:40.3442386Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:40.3442629Z #define __k8__ 1
2025-05-07T20:25:40.3442866Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:40.3443156Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:40.3443450Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:40.3443700Z #define __pic__ 2
2025-05-07T20:25:40.3443956Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:40.3444260Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:40.3444671Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:40.3445007Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:40.3445377Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:40.3445728Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:40.3446035Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:40.3446351Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:40.3446660Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:40.3446923Z #define __linux__ 1
2025-05-07T20:25:40.3447155Z #define __INT64_TYPE__ long int
2025-05-07T20:25:40.3447417Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:40.3447686Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:40.3447965Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:40.3448220Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:40.3448523Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:40.3448857Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:40.3449150Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:40.3449435Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:40.3449734Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:40.3450040Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:40.3450374Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:40.3450736Z #define __SSE__ 1
2025-05-07T20:25:40.3450963Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:40.3451305Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:40.3451650Z #define __amd64__ 1
2025-05-07T20:25:40.3451873Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:40.3452130Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:40.3452406Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:40.3452670Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:40.3452940Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:40.3453330Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:40.3453588Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:40.3453866Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:40.3454237Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:40.3454587Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:40.3455045Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:40.3455403Z #define _LP64 1
2025-05-07T20:25:40.3455625Z #define __UINT8_C(c) c
2025-05-07T20:25:40.3455868Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:40.3456182Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:40.3456465Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:40.3456735Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:40.3457039Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:40.3457400Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:40.3457856Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:40.3458231Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:40.3458528Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:40.3458849Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:40.3459565Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:40.3459999Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:40.3460260Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:40.3460586Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:40.3460945Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:40.3461201Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:40.3461441Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:40.3461688Z #define __FXSR__ 1
2025-05-07T20:25:40.3461981Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:40.3462423Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:40.3462824Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:40.3463126Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:40.3463379Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:40.3463953Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:40.3464310Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:40.3464554Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:40.3464782Z #define __PIC__ 2
2025-05-07T20:25:40.3465033Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:40.3465423Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:40.3465797Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:40.3466127Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:40.3466451Z #define __SSE2__ 1
2025-05-07T20:25:40.3466671Z #define __INT32_TYPE__ int
2025-05-07T20:25:40.3466914Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:40.3467169Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:40.3467500Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:40.3467847Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:40.3468119Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:40.3468399Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:40.3468660Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:40.3468929Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:40.3469176Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:40.3469416Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:40.3469703Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:40.3469997Z #define __PIE__ 2
2025-05-07T20:25:40.3470310Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:40.3470698Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:40.3471043Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:40.3471402Z #define __INT16_C(c) c
2025-05-07T20:25:40.3471621Z #define __STDC__ 1
2025-05-07T20:25:40.3471852Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:40.3472122Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:40.3472374Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:40.3472678Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:40.3473224Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:40.3473543Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:40.3473807Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:40.3474081Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:40.3474337Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:40.3474619Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:40.3474905Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:40.3475167Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:40.3475463Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:40.3475854Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:40.3476272Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:40.3476565Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:40.3476854Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:40.3477104Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:40.3477263Z 
2025-05-07T20:25:40.3986707Z 
2025-05-07T20:25:40.3987492Z [INFO] Printing out all preprocessor defines in the C++ compiler ...
2025-05-07T20:25:40.3987965Z + conda run -n build_binary c++ -dM -E -x c++ -
2025-05-07T20:25:40.3988195Z 
2025-05-07T20:25:42.2963065Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:42.2963449Z #define __cpp_attributes 200809L
2025-05-07T20:25:42.2963926Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:25:42.2964396Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:42.2964780Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:42.2965139Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:42.2965505Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:42.2965839Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:42.2966117Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:25:42.2966423Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:42.2966721Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:42.2966989Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:42.2967638Z #define __CHAR_BIT__ 8
2025-05-07T20:25:42.2967950Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:42.2978711Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:42.2979036Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:42.2979322Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:42.2979603Z #define __cpp_static_assert 201411L
2025-05-07T20:25:42.2979891Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:42.2980195Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:42.2980499Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:42.2980784Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:42.2981110Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:42.2981434Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:42.2981828Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:42.2982236Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:42.2982552Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:42.2982850Z #define __GCC_IEC_559 2
2025-05-07T20:25:42.2983094Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:42.2983374Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:42.2983651Z #define __cpp_binary_literals 201304L
2025-05-07T20:25:42.2983941Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:42.2984236Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:25:42.2984555Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:42.2984856Z #define __cpp_variadic_templates 200704L
2025-05-07T20:25:42.2985186Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:42.2985507Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:42.2985777Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:42.2986054Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:42.2986335Z #define __cpp_variable_templates 201304L
2025-05-07T20:25:42.2986625Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:42.2986908Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:42.2987206Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:42.2987755Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:25:42.2988074Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:25:42.2988399Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:42.2988656Z #define __INT8_C(c) c
2025-05-07T20:25:42.2988885Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:42.2989157Z #define __cpp_variadic_using 201611L
2025-05-07T20:25:42.2989477Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:42.2989791Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:42.2990064Z #define __cpp_capture_star_this 201603L
2025-05-07T20:25:42.2990349Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:42.2990655Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:42.2991006Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:42.2991290Z #define __cpp_if_constexpr 201606L
2025-05-07T20:25:42.2991565Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:42.2991825Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:42.2992117Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:42.2992394Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:42.2992777Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:42.2993186Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:42.2993472Z #define __linux 1
2025-05-07T20:25:42.2993697Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:42.2993982Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:42.2994261Z #define __unix 1
2025-05-07T20:25:42.2994480Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:42.2994768Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:25:42.2995062Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:42.2995329Z #define __WINT_MIN__ 0U
2025-05-07T20:25:42.2995579Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:42.2995864Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:42.2996137Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:42.2996395Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:42.2996810Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:42.2997092Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:42.2997381Z #define __INT64_C(c) c ## L
2025-05-07T20:25:42.2997648Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:42.2997949Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:42.2998215Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:42.2998519Z #define __cpp_aligned_new 201606L
2025-05-07T20:25:42.2998797Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:42.2999053Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:42.2999400Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:42.2999774Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:42.3000020Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:42.3000297Z #define __cpp_decltype_auto 201304L
2025-05-07T20:25:42.3000573Z #define __DBL_DIG__ 15
2025-05-07T20:25:42.3000805Z #define __FLT32_DIG__ 6
2025-05-07T20:25:42.3001102Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:42.3001457Z #define __GXX_WEAK__ 1
2025-05-07T20:25:42.3001697Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:42.3001941Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:42.3002270Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:42.3002617Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:42.3002875Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:42.3003172Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:25:42.3003498Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:25:42.3003887Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:42.3004284Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:42.3004556Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:42.3004813Z #define __unix__ 1
2025-05-07T20:25:42.3005029Z #define __INT_WIDTH__ 32
2025-05-07T20:25:42.3005273Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:42.3005520Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:42.3007586Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:42.3007919Z #define __UINT16_C(c) c
2025-05-07T20:25:42.3008213Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:42.3008518Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:42.3008966Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:42.3009318Z #define __gnu_linux__ 1
2025-05-07T20:25:42.3009558Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:42.3009818Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:42.3010082Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:42.3010367Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:42.3010628Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:42.3010879Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:42.3011126Z #define __GNUC__ 11
2025-05-07T20:25:42.3011338Z #define __GXX_RTTI 1
2025-05-07T20:25:42.3011553Z #define __pie__ 2
2025-05-07T20:25:42.3011759Z #define __MMX__ 1
2025-05-07T20:25:42.3011977Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:42.3012240Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:42.3012524Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:42.3012791Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:42.3013167Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:42.3013456Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:25:42.3013763Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:42.3014103Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:42.3014461Z #define __cpp_raw_strings 200710L
2025-05-07T20:25:42.3014761Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:42.3015075Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:42.3015332Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:42.3015678Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:42.3015982Z #define __cpp_fold_expressions 201603L
2025-05-07T20:25:42.3016316Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:42.3016629Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:42.3016952Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:42.3017434Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:42.3017790Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:42.3018119Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:42.3018432Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:42.3018676Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:42.3018943Z #define __cplusplus 201703L
2025-05-07T20:25:42.3019207Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:25:42.3019481Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:42.3019736Z #define __DEPRECATED 1
2025-05-07T20:25:42.3019988Z #define __cpp_rvalue_references 200610L
2025-05-07T20:25:42.3020270Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:42.3020523Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:42.3020839Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:42.3021189Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:42.3021448Z #define __SSE2_MATH__ 1
2025-05-07T20:25:42.3021694Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:42.3022002Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:42.3022285Z #define __amd64 1
2025-05-07T20:25:42.3022508Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:42.3022777Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:42.3023035Z #define __GNUG__ 11
2025-05-07T20:25:42.3023288Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:42.3023593Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:42.3023837Z #define __cpp_nsdmi 200809L
2025-05-07T20:25:42.3024096Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:42.3024366Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:42.3024608Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:42.3024876Z #define __cpp_initializer_lists 200806L
2025-05-07T20:25:42.3025170Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:42.3025433Z #define __cpp_hex_float 201603L
2025-05-07T20:25:42.3025688Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:42.3025951Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:42.3026223Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:42.3026575Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:42.3026843Z #define __x86_64 1
2025-05-07T20:25:42.3027067Z #define __cpp_lambdas 200907L
2025-05-07T20:25:42.3027327Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:42.3027690Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:42.3028071Z #define __cpp_template_auto 201606L
2025-05-07T20:25:42.3028414Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:42.3028859Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:42.3029318Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:42.3029695Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:42.3029944Z #define __LP64__ 1
2025-05-07T20:25:42.3030170Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:42.3030513Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:42.3030879Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:42.3031160Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:42.3031436Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:42.3031703Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:42.3031969Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:42.3032227Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:42.3032483Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:42.3032811Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:42.3033167Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:42.3033443Z #define __FLT_DIG__ 6
2025-05-07T20:25:42.3033667Z #define __NO_INLINE__ 1
2025-05-07T20:25:42.3033908Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:42.3034228Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:42.3034565Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:42.3034822Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:42.3035081Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:42.3035330Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:42.3035732Z #define __cpp_unicode_characters 201411L
2025-05-07T20:25:42.3036031Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:42.3036287Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:42.3036645Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:42.3036992Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:42.3037311Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:42.3037681Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:42.3038091Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:25:42.3038442Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:42.3038756Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:42.3039028Z #define __FLT128_DIG__ 33
2025-05-07T20:25:42.3039261Z #define __INT32_C(c) c
2025-05-07T20:25:42.3039491Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:42.3039766Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:42.3040040Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:42.3040310Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:42.3040632Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:42.3040934Z #define unix 1
2025-05-07T20:25:42.3041148Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:42.3041404Z #define __cpp_rtti 199711L
2025-05-07T20:25:42.3041667Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:42.3041969Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:42.3042269Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:42.3042574Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:42.3042895Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:42.3043139Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:42.3043424Z #define __cpp_digit_separators 201309L
2025-05-07T20:25:42.3043698Z #define __ELF__ 1
2025-05-07T20:25:42.3043924Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:42.3044206Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:42.3044480Z #define __FLT_RADIX__ 2
2025-05-07T20:25:42.3044722Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:42.3045168Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:42.3045523Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:42.3045790Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:25:42.3046065Z #define __k8 1
2025-05-07T20:25:42.3046411Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:42.3046858Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:42.3047227Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:42.3047594Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:42.3047916Z #define __LDBL_DIG__ 18
2025-05-07T20:25:42.3048209Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:42.3048526Z #define __x86_64__ 1
2025-05-07T20:25:42.3048821Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:42.3049171Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:42.3049502Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:42.3049809Z #define __FLT64_DIG__ 15
2025-05-07T20:25:42.3050090Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:42.3050442Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:42.3050753Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:42.3051015Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:42.3051289Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:42.3051583Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:42.3051946Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:42.3052330Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:42.3052617Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:42.3052931Z #define __cpp_unicode_literals 200710L
2025-05-07T20:25:42.3053387Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:42.3053704Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:42.3053996Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:42.3054268Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:42.3054571Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:42.3054941Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:42.3055175Z #define __SEG_FS 1
2025-05-07T20:25:42.3055408Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:42.3055685Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:42.3055954Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:42.3056240Z #define __SEG_GS 1
2025-05-07T20:25:42.3056546Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:42.3056918Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:42.3057181Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:42.3057468Z #define __INT16_TYPE__ short int
2025-05-07T20:25:42.3057741Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:42.3058039Z #define __cpp_structured_bindings 201606L
2025-05-07T20:25:42.3058332Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:42.3058583Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:42.3058834Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:42.3059175Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:42.3060085Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:42.3060392Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:25:42.3060715Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:25:42.3061008Z #define linux 1
2025-05-07T20:25:42.3061232Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:42.3061508Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:42.3061779Z #define __EXCEPTIONS 1
2025-05-07T20:25:42.3062023Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:42.3062276Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:42.3062541Z #define __cpp_range_based_for 201603L
2025-05-07T20:25:42.3062827Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:42.3063158Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:42.3063537Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:25:42.3063877Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:42.3064197Z #define __code_model_small__ 1
2025-05-07T20:25:42.3064718Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:42.3065032Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:25:42.3065328Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:42.3065609Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:25:42.3065895Z #define __k8__ 1
2025-05-07T20:25:42.3066125Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:42.3066407Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:42.3066727Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:42.3066989Z #define __pic__ 2
2025-05-07T20:25:42.3067229Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:42.3067531Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:42.3067793Z #define __cpp_decltype 200707L
2025-05-07T20:25:42.3068077Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:42.3068399Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:42.3068756Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:42.3069100Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:42.3069396Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:42.3069709Z #define __cpp_inline_variables 201606L
2025-05-07T20:25:42.3069991Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:42.3070234Z #define __linux__ 1
2025-05-07T20:25:42.3070457Z #define __INT64_TYPE__ long int
2025-05-07T20:25:42.3070717Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:42.3070966Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:42.3071233Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:42.3071510Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:25:42.3071810Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:42.3072097Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:42.3072405Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:42.3072663Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:42.3072949Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:42.3073235Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:42.3073735Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:42.3074099Z #define __SSE__ 1
2025-05-07T20:25:42.3074324Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:42.3074655Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:42.3074988Z #define __amd64__ 1
2025-05-07T20:25:42.3075209Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:42.3075457Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:42.3075716Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:42.3075978Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:42.3076247Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:42.3076498Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:42.3076773Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:42.3077064Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:42.3077421Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:42.3077870Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:42.3078219Z #define _LP64 1
2025-05-07T20:25:42.3078434Z #define __UINT8_C(c) c
2025-05-07T20:25:42.3078671Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:42.3078937Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:42.3079207Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:42.3079459Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:42.3079815Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:42.3080266Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:42.3080625Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:42.3080917Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:42.3081219Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:42.3081515Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:25:42.3081886Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:42.3082246Z #define __STDCPP_THREADS__ 1
2025-05-07T20:25:42.3082502Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:42.3082849Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:42.3083184Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:42.3083542Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:42.3083789Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:42.3084036Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:42.3084286Z #define __FXSR__ 1
2025-05-07T20:25:42.3084577Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:42.3085017Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:42.3085415Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:42.3085709Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:42.3085975Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:25:42.3086270Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:42.3086556Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:42.3086818Z #define __cpp_alias_templates 200704L
2025-05-07T20:25:42.3087171Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:42.3087539Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:42.3087817Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:42.3088054Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:42.3088291Z #define __PIC__ 2
2025-05-07T20:25:42.3088540Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:42.3088928Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:42.3089305Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:42.3089631Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:42.3089968Z #define __cpp_constexpr 201603L
2025-05-07T20:25:42.3090215Z #define __SSE2__ 1
2025-05-07T20:25:42.3090452Z #define __cpp_deduction_guides 201703L
2025-05-07T20:25:42.3090739Z #define __INT32_TYPE__ int
2025-05-07T20:25:42.3090980Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:42.3091242Z #define __cpp_exceptions 199711L
2025-05-07T20:25:42.3091516Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:42.3091932Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:42.3092286Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:42.3092552Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:42.3092812Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:42.3093219Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:42.3093491Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:42.3093732Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:42.3094063Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:25:42.3105403Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:42.3105755Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:42.3106054Z #define __PIE__ 2
2025-05-07T20:25:42.3106383Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:42.3106820Z #define __cpp_template_template_args 201611L
2025-05-07T20:25:42.3107156Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:42.3107507Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:42.3107882Z #define __INT16_C(c) c
2025-05-07T20:25:42.3108104Z #define __STDC__ 1
2025-05-07T20:25:42.3108322Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:42.3108580Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:42.3108847Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:42.3109109Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:42.3109409Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:42.3109754Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:42.3110080Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:42.3110347Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:42.3110634Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:25:42.3110903Z #define __SSE_MATH__ 1
2025-05-07T20:25:42.3111142Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:42.3111423Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:25:42.3111719Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:42.3112003Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:42.3112509Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:42.3112773Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:42.3113069Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:42.3113460Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:42.3113827Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:42.3114123Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:42.3114410Z #define _GNU_SOURCE 1
2025-05-07T20:25:42.3114657Z #define __cpp_init_captures 201304L
2025-05-07T20:25:42.3114930Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:42.3115181Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:42.3115338Z 
2025-05-07T20:25:42.3609385Z 
2025-05-07T20:25:42.3610196Z + conda run -n build_binary c++ --version
2025-05-07T20:25:42.3610680Z 
2025-05-07T20:25:44.2435513Z c++ (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:25:44.2436275Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:25:44.2437204Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:25:44.2437836Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:25:44.2438155Z 
2025-05-07T20:25:44.2438159Z 
2025-05-07T20:25:44.3051134Z 
2025-05-07T20:25:44.3051894Z [INFO] Printing the default version of the C standard used by the compiler ...
2025-05-07T20:25:44.3052461Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__
2025-05-07T20:25:44.3052762Z 
2025-05-07T20:25:46.2569559Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:46.2572023Z 
2025-05-07T20:25:46.2572550Z [INFO] Printing the default version of the C++ standard used by the compiler ...
2025-05-07T20:25:46.2573178Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus
2025-05-07T20:25:46.2573510Z 
2025-05-07T20:25:48.2096509Z #define __cplusplus 201703L
2025-05-07T20:25:48.2098575Z 
2025-05-07T20:25:48.2099210Z [INSTALL] Successfully installed C/C++ compilers
2025-05-07T20:25:48.2133741Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.6.3
2025-05-07T20:25:48.2134166Z [36;1m. $PRELUDE; install_cuda $BUILD_ENV 12.6.3[0m
2025-05-07T20:25:48.2146515Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:48.2146856Z env:
2025-05-07T20:25:48.2147081Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:48.2147383Z   BUILD_ENV: build_binary
2025-05-07T20:25:48.2147620Z   BUILD_TARGET: genai
2025-05-07T20:25:48.2147846Z   BUILD_VARIANT: cuda
2025-05-07T20:25:48.2148079Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:25:48.2148336Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:48.2148629Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:48.2148952Z ##[endgroup]
2025-05-07T20:25:48.5507432Z ################################################################################
2025-05-07T20:25:48.5507779Z # Install CUDA
2025-05-07T20:25:48.5508001Z #
2025-05-07T20:25:48.5524454Z # [2025-05-07T20:25:48.552Z] + install_cuda build_binary 12.6.3
2025-05-07T20:25:48.5524853Z ################################################################################
2025-05-07T20:25:48.5525077Z 
2025-05-07T20:25:48.5541092Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:48.6437520Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:48.6438562Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:25:48.6444150Z + conda clean --packages --tarball -y
2025-05-07T20:25:48.6444367Z 
2025-05-07T20:25:49.5107378Z Will remove 40 (182.7 MB) tarball(s).
2025-05-07T20:25:49.5107838Z Will remove 7 (108.6 MB) package(s).
2025-05-07T20:25:49.5760394Z 
2025-05-07T20:25:49.5769204Z + conda clean --all -y
2025-05-07T20:25:49.5769438Z 
2025-05-07T20:25:50.2689632Z There are no unused tarball(s) to remove.
2025-05-07T20:25:50.2690338Z Will remove 1 index cache(s).
2025-05-07T20:25:50.2690976Z There are no unused package(s) to remove.
2025-05-07T20:25:50.2691652Z There are no tempfile(s) to remove.
2025-05-07T20:25:50.2692263Z There are no logfile(s) to remove.
2025-05-07T20:25:50.3332937Z 
2025-05-07T20:25:50.3346579Z [INSTALL] Installing CUDA 12.6.3 ...
2025-05-07T20:25:50.3370123Z [EXEC] [ATTEMPT 0/3]    + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.6.3
2025-05-07T20:25:51.2503807Z Channels:
2025-05-07T20:25:51.2504068Z  - conda-forge
2025-05-07T20:25:51.2504310Z Platform: linux-64
2025-05-07T20:26:01.7726194Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:26:02.8630439Z Solving environment: / - \ | / done
2025-05-07T20:26:02.9387007Z 
2025-05-07T20:26:02.9387524Z ## Package Plan ##
2025-05-07T20:26:02.9387710Z 
2025-05-07T20:26:02.9387931Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:26:02.9388224Z 
2025-05-07T20:26:02.9388322Z   added / updated specs:
2025-05-07T20:26:02.9388574Z     - cuda=12.6.3
2025-05-07T20:26:02.9388707Z 
2025-05-07T20:26:02.9388734Z 
2025-05-07T20:26:02.9388877Z The following packages will be downloaded:
2025-05-07T20:26:02.9389089Z 
2025-05-07T20:26:02.9389204Z     package                    |            build
2025-05-07T20:26:02.9389539Z     ---------------------------|-----------------
2025-05-07T20:26:02.9390054Z     alsa-lib-1.2.14            |       hb9d3cd8_0         553 KB  conda-forge
2025-05-07T20:26:02.9390616Z     attr-2.5.1                 |       h166bdaf_1          69 KB  conda-forge
2025-05-07T20:26:02.9391157Z     binutils-2.40              |       h4852527_7          31 KB  conda-forge
2025-05-07T20:26:02.9391732Z     c-compiler-1.5.2           |       h0b41bf4_0           6 KB  conda-forge
2025-05-07T20:26:02.9392160Z     cuda-12.6.3                |       ha804496_0          26 KB  conda-forge
2025-05-07T20:26:02.9392583Z     cuda-cccl_linux-64-12.6.77 |       ha770c72_0         1.0 MB  conda-forge
2025-05-07T20:26:02.9393465Z     cuda-command-line-tools-12.6.3|       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:02.9393969Z     cuda-compiler-12.6.3       |       hbad6d8a_0          20 KB  conda-forge
2025-05-07T20:26:02.9394432Z     cuda-crt-dev_linux-64-12.6.85|       ha770c72_0          87 KB  conda-forge
2025-05-07T20:26:02.9394890Z     cuda-crt-tools-12.6.85     |       ha770c72_0          26 KB  conda-forge
2025-05-07T20:26:02.9395330Z     cuda-cudart-12.6.77        |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:26:02.9395775Z     cuda-cudart-dev-12.6.77    |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:26:02.9396309Z     cuda-cudart-dev_linux-64-12.6.77|       h3f2d84a_0         357 KB  conda-forge
2025-05-07T20:26:02.9396797Z     cuda-cudart-static-12.6.77 |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:26:02.9397301Z     cuda-cudart-static_linux-64-12.6.77|       h3f2d84a_0         744 KB  conda-forge
2025-05-07T20:26:02.9397801Z     cuda-cudart_linux-64-12.6.77|       h3f2d84a_0         184 KB  conda-forge
2025-05-07T20:26:02.9398270Z     cuda-cuobjdump-12.6.77     |       hbd13f7d_1         241 KB  conda-forge
2025-05-07T20:26:02.9398721Z     cuda-cupti-12.6.80         |       hbd13f7d_0         1.9 MB  conda-forge
2025-05-07T20:26:02.9399157Z     cuda-cupti-dev-12.6.80     |       h5888daf_0         3.4 MB  conda-forge
2025-05-07T20:26:02.9399606Z     cuda-cuxxfilt-12.6.77      |       hbd13f7d_1         211 KB  conda-forge
2025-05-07T20:26:02.9400057Z     cuda-driver-dev-12.6.77    |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:26:02.9400537Z     cuda-driver-dev_linux-64-12.6.77|       h3f2d84a_0          35 KB  conda-forge
2025-05-07T20:26:02.9400993Z     cuda-gdb-12.6.77           |       h50b4baa_1         370 KB  conda-forge
2025-05-07T20:26:02.9401427Z     cuda-libraries-12.6.3      |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:02.9401890Z     cuda-libraries-dev-12.6.3  |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:02.9402359Z     cuda-nsight-12.6.77        |       h7938cbb_0       113.2 MB  conda-forge
2025-05-07T20:26:02.9402964Z     cuda-nvcc-12.6.85          |       hcdd1206_0          23 KB  conda-forge
2025-05-07T20:26:02.9403412Z     cuda-nvcc-dev_linux-64-12.6.85|       he91c749_0        10.8 MB  conda-forge
2025-05-07T20:26:02.9403875Z     cuda-nvcc-impl-12.6.85     |       h85509e4_0          25 KB  conda-forge
2025-05-07T20:26:02.9404332Z     cuda-nvcc-tools-12.6.85    |       he02047a_0        23.0 MB  conda-forge
2025-05-07T20:26:02.9404837Z     cuda-nvcc_linux-64-12.6.85 |       h04802cd_0          25 KB  conda-forge
2025-05-07T20:26:02.9405287Z     cuda-nvdisasm-12.6.77      |       hbd13f7d_1        47.6 MB  conda-forge
2025-05-07T20:26:02.9405737Z     cuda-nvml-dev-12.6.77      |       hbd13f7d_1         159 KB  conda-forge
2025-05-07T20:26:02.9406179Z     cuda-nvprof-12.6.80        |       hbd13f7d_0         2.6 MB  conda-forge
2025-05-07T20:26:02.9406614Z     cuda-nvprune-12.6.77       |       hbd13f7d_1          66 KB  conda-forge
2025-05-07T20:26:02.9407057Z     cuda-nvrtc-12.6.85         |       hbd13f7d_0        17.3 MB  conda-forge
2025-05-07T20:26:02.9407501Z     cuda-nvrtc-dev-12.6.85     |       h5888daf_0          31 KB  conda-forge
2025-05-07T20:26:02.9407944Z     cuda-nvtx-12.6.77          |       hbd13f7d_0          31 KB  conda-forge
2025-05-07T20:26:02.9408387Z     cuda-nvvm-dev_linux-64-12.6.85|       ha770c72_0          25 KB  conda-forge
2025-05-07T20:26:02.9408853Z     cuda-nvvm-impl-12.6.85     |       he02047a_0         7.7 MB  conda-forge
2025-05-07T20:26:02.9409311Z     cuda-nvvm-tools-12.6.85    |       he02047a_0        10.4 MB  conda-forge
2025-05-07T20:26:02.9409743Z     cuda-nvvp-12.6.80          |       hbd13f7d_1       109.3 MB  conda-forge
2025-05-07T20:26:02.9410173Z     cuda-opencl-12.6.77        |       hbd13f7d_0          29 KB  conda-forge
2025-05-07T20:26:02.9410619Z     cuda-opencl-dev-12.6.77    |       h5888daf_0          93 KB  conda-forge
2025-05-07T20:26:02.9411208Z     cuda-profiler-api-12.6.77  |       h7938cbb_0          22 KB  conda-forge
2025-05-07T20:26:02.9411671Z     cuda-runtime-12.6.3        |       ha804496_0          19 KB  conda-forge
2025-05-07T20:26:02.9412137Z     cuda-sanitizer-api-12.6.77 |       hbd13f7d_1         8.9 MB  conda-forge
2025-05-07T20:26:02.9412601Z     cuda-toolkit-12.6.3        |       ha804496_0          19 KB  conda-forge
2025-05-07T20:26:02.9413032Z     cuda-tools-12.6.3          |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:26:02.9413551Z     cuda-version-12.6          |       h7480c83_3          20 KB  conda-forge
2025-05-07T20:26:02.9414008Z     cuda-visual-tools-12.6.3   |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:26:02.9414463Z     cxx-compiler-1.5.2         |       hf52228f_0           6 KB  conda-forge
2025-05-07T20:26:02.9414867Z     dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
2025-05-07T20:26:02.9415320Z     font-ttf-dejavu-sans-mono-2.37|       hab24e00_0         388 KB  conda-forge
2025-05-07T20:26:02.9415831Z     font-ttf-inconsolata-3.000 |       h77eed37_0          94 KB  conda-forge
2025-05-07T20:26:02.9416342Z     font-ttf-source-code-pro-2.038|       h77eed37_0         684 KB  conda-forge
2025-05-07T20:26:02.9416818Z     font-ttf-ubuntu-0.83       |       h77eed37_3         1.5 MB  conda-forge
2025-05-07T20:26:02.9417259Z     fontconfig-2.15.0          |       h7e30c49_1         259 KB  conda-forge
2025-05-07T20:26:02.9417719Z     fonts-conda-ecosystem-1    |                0           4 KB  conda-forge
2025-05-07T20:26:02.9418179Z     fonts-conda-forge-1        |                0           4 KB  conda-forge
2025-05-07T20:26:02.9418611Z     freetype-2.13.3            |       ha770c72_1         168 KB  conda-forge
2025-05-07T20:26:02.9419007Z     gcc-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:26:02.9419404Z     gds-tools-1.11.1.6         |       h5888daf_4        37.8 MB  conda-forge
2025-05-07T20:26:02.9419793Z     gmp-6.3.0                  |       hac33072_2         449 KB  conda-forge
2025-05-07T20:26:02.9420166Z     gxx-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:26:02.9420641Z     keyutils-1.6.1             |       h166bdaf_0         115 KB  conda-forge
2025-05-07T20:26:02.9421028Z     krb5-1.21.3                |       h659f571_0         1.3 MB  conda-forge
2025-05-07T20:26:02.9421417Z     libcap-2.71                |       h39aace5_0         100 KB  conda-forge
2025-05-07T20:26:02.9421826Z     libcublas-12.6.4.1         |       h5888daf_1       256.2 MB  conda-forge
2025-05-07T20:26:02.9422264Z     libcublas-dev-12.6.4.1     |       h5888daf_1          88 KB  conda-forge
2025-05-07T20:26:02.9422694Z     libcufft-11.3.0.4          |       hbd13f7d_0       156.2 MB  conda-forge
2025-05-07T20:26:02.9423125Z     libcufft-dev-11.3.0.4      |       h5888daf_0          33 KB  conda-forge
2025-05-07T20:26:02.9423563Z     libcufile-1.11.1.6         |       h12f29b5_4         900 KB  conda-forge
2025-05-07T20:26:02.9424010Z     libcufile-dev-1.11.1.6     |       h5888daf_4          35 KB  conda-forge
2025-05-07T20:26:02.9424452Z     libcurand-10.3.7.77        |       hbd13f7d_0        39.9 MB  conda-forge
2025-05-07T20:26:02.9424895Z     libcurand-dev-10.3.7.77    |       h5888daf_0         262 KB  conda-forge
2025-05-07T20:26:02.9425348Z     libcusolver-11.7.1.2       |       h5888daf_1        95.8 MB  conda-forge
2025-05-07T20:26:02.9425796Z     libcusolver-dev-11.7.1.2   |       h5888daf_1          59 KB  conda-forge
2025-05-07T20:26:02.9426249Z     libcusparse-12.5.4.2       |       hbd13f7d_0       118.6 MB  conda-forge
2025-05-07T20:26:02.9426706Z     libcusparse-dev-12.5.4.2   |       h5888daf_0          51 KB  conda-forge
2025-05-07T20:26:02.9427170Z     libedit-3.1.20191231       |       he28a2e2_2         121 KB  conda-forge
2025-05-07T20:26:02.9427608Z     libfreetype-2.13.3         |       ha770c72_1           8 KB  conda-forge
2025-05-07T20:26:02.9428051Z     libfreetype6-2.13.3        |       h48d6fc4_1         371 KB  conda-forge
2025-05-07T20:26:02.9428602Z     libgcrypt-lib-1.11.0       |       hb9d3cd8_2         572 KB  conda-forge
2025-05-07T20:26:02.9429043Z     libglib-2.84.0             |       h2ff4ddf_0         3.8 MB  conda-forge
2025-05-07T20:26:02.9429460Z     libgpg-error-1.55          |       h3f2d84a_0         305 KB  conda-forge
2025-05-07T20:26:02.9429884Z     libiconv-1.18              |       h4ce23a2_1         696 KB  conda-forge
2025-05-07T20:26:02.9430279Z     libnl-3.11.0               |       hb9d3cd8_0         724 KB  conda-forge
2025-05-07T20:26:02.9430673Z     libnpp-12.3.1.54           |       h5888daf_0        93.4 MB  conda-forge
2025-05-07T20:26:02.9431098Z     libnpp-dev-12.3.1.54       |       h5888daf_0         441 KB  conda-forge
2025-05-07T20:26:02.9431524Z     libnuma-2.0.18             |       h4ab18f5_2          42 KB  conda-forge
2025-05-07T20:26:02.9431949Z     libnvfatbin-12.6.77        |       hbd13f7d_0         783 KB  conda-forge
2025-05-07T20:26:02.9432406Z     libnvfatbin-dev-12.6.77    |       h5888daf_0          26 KB  conda-forge
2025-05-07T20:26:02.9432873Z     libnvjitlink-12.6.85       |       hbd13f7d_0        14.9 MB  conda-forge
2025-05-07T20:26:02.9433339Z     libnvjitlink-dev-12.6.85   |       h5888daf_0          25 KB  conda-forge
2025-05-07T20:26:02.9433783Z     libnvjpeg-12.3.3.54        |       h5888daf_0         2.4 MB  conda-forge
2025-05-07T20:26:02.9434236Z     libnvjpeg-dev-12.3.3.54    |       ha770c72_0          31 KB  conda-forge
2025-05-07T20:26:02.9434664Z     libpng-1.6.47              |       h943b412_0         282 KB  conda-forge
2025-05-07T20:26:02.9435082Z     libsqlite-3.49.2           |       hee588c1_0         895 KB  conda-forge
2025-05-07T20:26:02.9435509Z     libsystemd0-256.9          |       h2774228_0         401 KB  conda-forge
2025-05-07T20:26:02.9435937Z     libudev1-257.4             |       h9a4d06a_0         140 KB  conda-forge
2025-05-07T20:26:02.9436346Z     libxcb-1.17.0              |       h8a09558_0         387 KB  conda-forge
2025-05-07T20:26:02.9436766Z     libxkbcommon-1.8.0         |       hc4a0caf_0         627 KB  conda-forge
2025-05-07T20:26:02.9437288Z     libxkbfile-1.1.0           |       h166bdaf_1         111 KB  conda-forge
2025-05-07T20:26:02.9437699Z     libxml2-2.13.5             |       h064dc61_0         673 KB  conda-forge
2025-05-07T20:26:02.9438099Z     libzlib-1.3.1              |       hb9d3cd8_2          60 KB  conda-forge
2025-05-07T20:26:02.9438489Z     lz4-c-1.9.4                |       hcb278e6_0         140 KB  conda-forge
2025-05-07T20:26:02.9438924Z     nsight-compute-2024.3.2.3  |       hb5ebaad_0       443.1 MB  conda-forge
2025-05-07T20:26:02.9439359Z     nspr-4.36                  |       h5888daf_0         225 KB  conda-forge
2025-05-07T20:26:02.9439734Z     nss-3.111                  |       h159eef7_0         1.9 MB  conda-forge
2025-05-07T20:26:02.9440109Z     ocl-icd-2.3.3              |       hb9d3cd8_0         104 KB  conda-forge
2025-05-07T20:26:02.9440554Z     opencl-headers-2024.10.24  |       h5888daf_0          53 KB  conda-forge
2025-05-07T20:26:02.9440992Z     pcre2-10.44                |       hc749103_2         934 KB  conda-forge
2025-05-07T20:26:02.9441407Z     pthread-stubs-0.4          |    hb9d3cd8_1002           8 KB  conda-forge
2025-05-07T20:26:02.9441833Z     rdma-core-55.0             |       h5888daf_0         1.2 MB  conda-forge
2025-05-07T20:26:02.9442234Z     sqlite-3.32.3              |       hcee41ef_1         1.4 MB  conda-forge
2025-05-07T20:26:02.9442623Z     tk-8.6.13                  |noxft_h4845f30_101         3.2 MB  conda-forge
2025-05-07T20:26:02.9443013Z     wayland-1.23.1             |       h3e06ad9_0         314 KB  conda-forge
2025-05-07T20:26:02.9443413Z     xcb-util-0.4.1             |       hb711507_2          19 KB  conda-forge
2025-05-07T20:26:02.9443840Z     xcb-util-cursor-0.1.5      |       hb9d3cd8_0          20 KB  conda-forge
2025-05-07T20:26:02.9444279Z     xcb-util-image-0.4.0       |       hb711507_2          24 KB  conda-forge
2025-05-07T20:26:02.9444815Z     xcb-util-keysyms-0.4.1     |       hb711507_0          14 KB  conda-forge
2025-05-07T20:26:02.9445298Z     xcb-util-renderutil-0.3.10 |       hb711507_0          17 KB  conda-forge
2025-05-07T20:26:02.9445748Z     xcb-util-wm-0.4.2          |       hb711507_0          50 KB  conda-forge
2025-05-07T20:26:02.9446186Z     xkeyboard-config-2.44      |       hb9d3cd8_0         384 KB  conda-forge
2025-05-07T20:26:02.9446629Z     xorg-libice-1.1.2          |       hb9d3cd8_0          57 KB  conda-forge
2025-05-07T20:26:02.9447050Z     xorg-libsm-1.2.6           |       he73a12e_0          27 KB  conda-forge
2025-05-07T20:26:02.9447465Z     xorg-libx11-1.8.12         |       h4f16b4b_0         816 KB  conda-forge
2025-05-07T20:26:02.9447887Z     xorg-libxau-1.0.12         |       hb9d3cd8_0          14 KB  conda-forge
2025-05-07T20:26:02.9448342Z     xorg-libxcomposite-0.4.6   |       hb9d3cd8_2          13 KB  conda-forge
2025-05-07T20:26:02.9448810Z     xorg-libxdamage-1.1.6      |       hb9d3cd8_0          13 KB  conda-forge
2025-05-07T20:26:02.9449264Z     xorg-libxdmcp-1.1.5        |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:26:02.9449702Z     xorg-libxext-1.3.6         |       hb9d3cd8_0          49 KB  conda-forge
2025-05-07T20:26:02.9450143Z     xorg-libxfixes-6.0.1       |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:26:02.9450578Z     xorg-libxi-1.8.2           |       hb9d3cd8_0          46 KB  conda-forge
2025-05-07T20:26:02.9451008Z     xorg-libxrandr-1.5.4       |       hb9d3cd8_0          29 KB  conda-forge
2025-05-07T20:26:02.9451461Z     xorg-libxrender-0.9.12     |       hb9d3cd8_0          32 KB  conda-forge
2025-05-07T20:26:02.9451910Z     xorg-libxtst-1.2.5         |       hb9d3cd8_3          32 KB  conda-forge
2025-05-07T20:26:02.9452310Z     zlib-1.3.1                 |       hb9d3cd8_2          90 KB  conda-forge
2025-05-07T20:26:02.9452686Z     zstd-1.5.7                 |       hb8e6e7a_2         554 KB  conda-forge
2025-05-07T20:26:02.9453156Z     ------------------------------------------------------------
2025-05-07T20:26:02.9453611Z                                            Total:        1.61 GB
2025-05-07T20:26:02.9453821Z 
2025-05-07T20:26:02.9453949Z The following NEW packages will be INSTALLED:
2025-05-07T20:26:02.9454169Z 
2025-05-07T20:26:02.9454375Z   alsa-lib           conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 
2025-05-07T20:26:02.9454789Z   attr               conda-forge/linux-64::attr-2.5.1-h166bdaf_1 
2025-05-07T20:26:02.9455201Z   binutils           conda-forge/linux-64::binutils-2.40-h4852527_7 
2025-05-07T20:26:02.9455646Z   c-compiler         conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 
2025-05-07T20:26:02.9456072Z   cuda               conda-forge/noarch::cuda-12.6.3-ha804496_0 
2025-05-07T20:26:02.9456541Z   cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.6.77-ha770c72_0 
2025-05-07T20:26:02.9457114Z   cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.6.3-ha770c72_0 
2025-05-07T20:26:02.9457687Z   cuda-compiler      conda-forge/noarch::cuda-compiler-12.6.3-hbad6d8a_0 
2025-05-07T20:26:02.9458403Z   cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.6.85-ha770c72_0 
2025-05-07T20:26:02.9458950Z   cuda-crt-tools     conda-forge/linux-64::cuda-crt-tools-12.6.85-ha770c72_0 
2025-05-07T20:26:02.9459861Z   cuda-cudart        conda-forge/linux-64::cuda-cudart-12.6.77-h5888daf_0 
2025-05-07T20:26:02.9460378Z   cuda-cudart-dev    conda-forge/linux-64::cuda-cudart-dev-12.6.77-h5888daf_0 
2025-05-07T20:26:02.9460942Z   cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:26:02.9461533Z   cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.6.77-h5888daf_0 
2025-05-07T20:26:02.9462128Z   cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:26:02.9462716Z   cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:26:02.9463427Z   cuda-cuobjdump     conda-forge/linux-64::cuda-cuobjdump-12.6.77-hbd13f7d_1 
2025-05-07T20:26:02.9463944Z   cuda-cupti         conda-forge/linux-64::cuda-cupti-12.6.80-hbd13f7d_0 
2025-05-07T20:26:02.9464433Z   cuda-cupti-dev     conda-forge/linux-64::cuda-cupti-dev-12.6.80-h5888daf_0 
2025-05-07T20:26:02.9465004Z   cuda-cuxxfilt      conda-forge/linux-64::cuda-cuxxfilt-12.6.77-hbd13f7d_1 
2025-05-07T20:26:02.9465528Z   cuda-driver-dev    conda-forge/linux-64::cuda-driver-dev-12.6.77-h5888daf_0 
2025-05-07T20:26:02.9466085Z   cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:26:02.9466598Z   cuda-gdb           conda-forge/linux-64::cuda-gdb-12.6.77-h50b4baa_1 
2025-05-07T20:26:02.9467082Z   cuda-libraries     conda-forge/linux-64::cuda-libraries-12.6.3-ha770c72_0 
2025-05-07T20:26:02.9467633Z   cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.6.3-ha770c72_0 
2025-05-07T20:26:02.9468163Z   cuda-nsight        conda-forge/linux-64::cuda-nsight-12.6.77-h7938cbb_0 
2025-05-07T20:26:02.9468632Z   cuda-nvcc          conda-forge/linux-64::cuda-nvcc-12.6.85-hcdd1206_0 
2025-05-07T20:26:02.9469147Z   cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.6.85-he91c749_0 
2025-05-07T20:26:02.9469696Z   cuda-nvcc-impl     conda-forge/linux-64::cuda-nvcc-impl-12.6.85-h85509e4_0 
2025-05-07T20:26:02.9470227Z   cuda-nvcc-tools    conda-forge/linux-64::cuda-nvcc-tools-12.6.85-he02047a_0 
2025-05-07T20:26:02.9470767Z   cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.6.85-h04802cd_0 
2025-05-07T20:26:02.9471300Z   cuda-nvdisasm      conda-forge/linux-64::cuda-nvdisasm-12.6.77-hbd13f7d_1 
2025-05-07T20:26:02.9471809Z   cuda-nvml-dev      conda-forge/linux-64::cuda-nvml-dev-12.6.77-hbd13f7d_1 
2025-05-07T20:26:02.9472295Z   cuda-nvprof        conda-forge/linux-64::cuda-nvprof-12.6.80-hbd13f7d_0 
2025-05-07T20:26:02.9481563Z   cuda-nvprune       conda-forge/linux-64::cuda-nvprune-12.6.77-hbd13f7d_1 
2025-05-07T20:26:02.9482151Z   cuda-nvrtc         conda-forge/linux-64::cuda-nvrtc-12.6.85-hbd13f7d_0 
2025-05-07T20:26:02.9482846Z   cuda-nvrtc-dev     conda-forge/linux-64::cuda-nvrtc-dev-12.6.85-h5888daf_0 
2025-05-07T20:26:02.9483611Z   cuda-nvtx          conda-forge/linux-64::cuda-nvtx-12.6.77-hbd13f7d_0 
2025-05-07T20:26:02.9484135Z   cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.6.85-ha770c72_0 
2025-05-07T20:26:02.9484683Z   cuda-nvvm-impl     conda-forge/linux-64::cuda-nvvm-impl-12.6.85-he02047a_0 
2025-05-07T20:26:02.9485217Z   cuda-nvvm-tools    conda-forge/linux-64::cuda-nvvm-tools-12.6.85-he02047a_0 
2025-05-07T20:26:02.9485716Z   cuda-nvvp          conda-forge/linux-64::cuda-nvvp-12.6.80-hbd13f7d_1 
2025-05-07T20:26:02.9486193Z   cuda-opencl        conda-forge/linux-64::cuda-opencl-12.6.77-hbd13f7d_0 
2025-05-07T20:26:02.9486700Z   cuda-opencl-dev    conda-forge/linux-64::cuda-opencl-dev-12.6.77-h5888daf_0 
2025-05-07T20:26:02.9487266Z   cuda-profiler-api  conda-forge/linux-64::cuda-profiler-api-12.6.77-h7938cbb_0 
2025-05-07T20:26:02.9487804Z   cuda-runtime       conda-forge/noarch::cuda-runtime-12.6.3-ha804496_0 
2025-05-07T20:26:02.9488342Z   cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.6.77-hbd13f7d_1 
2025-05-07T20:26:02.9488882Z   cuda-toolkit       conda-forge/noarch::cuda-toolkit-12.6.3-ha804496_0 
2025-05-07T20:26:02.9489351Z   cuda-tools         conda-forge/linux-64::cuda-tools-12.6.3-ha770c72_0 
2025-05-07T20:26:02.9489821Z   cuda-version       conda-forge/noarch::cuda-version-12.6-h7480c83_3 
2025-05-07T20:26:02.9490331Z   cuda-visual-tools  conda-forge/linux-64::cuda-visual-tools-12.6.3-ha770c72_0 
2025-05-07T20:26:02.9490868Z   cxx-compiler       conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 
2025-05-07T20:26:02.9491313Z   dbus               conda-forge/linux-64::dbus-1.13.6-h5008d03_3 
2025-05-07T20:26:02.9491813Z   font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 
2025-05-07T20:26:02.9492396Z   font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 
2025-05-07T20:26:02.9493154Z   font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 
2025-05-07T20:26:02.9493722Z   font-ttf-ubuntu    conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 
2025-05-07T20:26:02.9494209Z   fontconfig         conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 
2025-05-07T20:26:02.9494722Z   fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 
2025-05-07T20:26:02.9495225Z   fonts-conda-forge  conda-forge/noarch::fonts-conda-forge-1-0 
2025-05-07T20:26:02.9495675Z   freetype           conda-forge/linux-64::freetype-2.13.3-ha770c72_1 
2025-05-07T20:26:02.9496085Z   gcc                conda-forge/linux-64::gcc-11.4.0-h602e360_13 
2025-05-07T20:26:02.9496499Z   gds-tools          conda-forge/linux-64::gds-tools-1.11.1.6-h5888daf_4 
2025-05-07T20:26:02.9496913Z   gmp                conda-forge/linux-64::gmp-6.3.0-hac33072_2 
2025-05-07T20:26:02.9497280Z   gxx                conda-forge/linux-64::gxx-11.4.0-h602e360_13 
2025-05-07T20:26:02.9497686Z   keyutils           conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 
2025-05-07T20:26:02.9498097Z   krb5               conda-forge/linux-64::krb5-1.21.3-h659f571_0 
2025-05-07T20:26:02.9498484Z   libcap             conda-forge/linux-64::libcap-2.71-h39aace5_0 
2025-05-07T20:26:02.9498918Z   libcublas          conda-forge/linux-64::libcublas-12.6.4.1-h5888daf_1 
2025-05-07T20:26:02.9499412Z   libcublas-dev      conda-forge/linux-64::libcublas-dev-12.6.4.1-h5888daf_1 
2025-05-07T20:26:02.9499899Z   libcufft           conda-forge/linux-64::libcufft-11.3.0.4-hbd13f7d_0 
2025-05-07T20:26:02.9500362Z   libcufft-dev       conda-forge/linux-64::libcufft-dev-11.3.0.4-h5888daf_0 
2025-05-07T20:26:02.9500842Z   libcufile          conda-forge/linux-64::libcufile-1.11.1.6-h12f29b5_4 
2025-05-07T20:26:02.9501327Z   libcufile-dev      conda-forge/linux-64::libcufile-dev-1.11.1.6-h5888daf_4 
2025-05-07T20:26:02.9501808Z   libcurand          conda-forge/linux-64::libcurand-10.3.7.77-hbd13f7d_0 
2025-05-07T20:26:02.9502295Z   libcurand-dev      conda-forge/linux-64::libcurand-dev-10.3.7.77-h5888daf_0 
2025-05-07T20:26:02.9502884Z   libcusolver        conda-forge/linux-64::libcusolver-11.7.1.2-h5888daf_1 
2025-05-07T20:26:02.9503404Z   libcusolver-dev    conda-forge/linux-64::libcusolver-dev-11.7.1.2-h5888daf_1 
2025-05-07T20:26:02.9503922Z   libcusparse        conda-forge/linux-64::libcusparse-12.5.4.2-hbd13f7d_0 
2025-05-07T20:26:02.9504437Z   libcusparse-dev    conda-forge/linux-64::libcusparse-dev-12.5.4.2-h5888daf_0 
2025-05-07T20:26:02.9504939Z   libedit            conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 
2025-05-07T20:26:02.9505403Z   libfreetype        conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 
2025-05-07T20:26:02.9505895Z   libfreetype6       conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 
2025-05-07T20:26:02.9506394Z   libgcrypt-lib      conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 
2025-05-07T20:26:02.9506867Z   libglib            conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 
2025-05-07T20:26:02.9507335Z   libgpg-error       conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 
2025-05-07T20:26:02.9507802Z   libiconv           conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 
2025-05-07T20:26:02.9508218Z   libnl              conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 
2025-05-07T20:26:02.9508639Z   libnpp             conda-forge/linux-64::libnpp-12.3.1.54-h5888daf_0 
2025-05-07T20:26:02.9509096Z   libnpp-dev         conda-forge/linux-64::libnpp-dev-12.3.1.54-h5888daf_0 
2025-05-07T20:26:02.9509544Z   libnuma            conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 
2025-05-07T20:26:02.9510004Z   libnvfatbin        conda-forge/linux-64::libnvfatbin-12.6.77-hbd13f7d_0 
2025-05-07T20:26:02.9510527Z   libnvfatbin-dev    conda-forge/linux-64::libnvfatbin-dev-12.6.77-h5888daf_0 
2025-05-07T20:26:02.9511056Z   libnvjitlink       conda-forge/linux-64::libnvjitlink-12.6.85-hbd13f7d_0 
2025-05-07T20:26:02.9511583Z   libnvjitlink-dev   conda-forge/linux-64::libnvjitlink-dev-12.6.85-h5888daf_0 
2025-05-07T20:26:02.9512198Z   libnvjpeg          conda-forge/linux-64::libnvjpeg-12.3.3.54-h5888daf_0 
2025-05-07T20:26:02.9512702Z   libnvjpeg-dev      conda-forge/linux-64::libnvjpeg-dev-12.3.3.54-ha770c72_0 
2025-05-07T20:26:02.9513177Z   libpng             conda-forge/linux-64::libpng-1.6.47-h943b412_0 
2025-05-07T20:26:02.9513617Z   libsystemd0        conda-forge/linux-64::libsystemd0-256.9-h2774228_0 
2025-05-07T20:26:02.9514074Z   libudev1           conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 
2025-05-07T20:26:02.9514498Z   libxcb             conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 
2025-05-07T20:26:02.9514948Z   libxkbcommon       conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 
2025-05-07T20:26:02.9515418Z   libxkbfile         conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 
2025-05-07T20:26:02.9515862Z   libxml2            conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 
2025-05-07T20:26:02.9516265Z   lz4-c              conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 
2025-05-07T20:26:02.9516735Z   nsight-compute     conda-forge/linux-64::nsight-compute-2024.3.2.3-hb5ebaad_0 
2025-05-07T20:26:02.9517214Z   nspr               conda-forge/linux-64::nspr-4.36-h5888daf_0 
2025-05-07T20:26:02.9517583Z   nss                conda-forge/linux-64::nss-3.111-h159eef7_0 
2025-05-07T20:26:02.9517970Z   ocl-icd            conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 
2025-05-07T20:26:02.9518444Z   opencl-headers     conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 
2025-05-07T20:26:02.9518923Z   pcre2              conda-forge/linux-64::pcre2-10.44-hc749103_2 
2025-05-07T20:26:02.9519380Z   pthread-stubs      conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 
2025-05-07T20:26:02.9519857Z   rdma-core          conda-forge/linux-64::rdma-core-55.0-h5888daf_0 
2025-05-07T20:26:02.9520280Z   wayland            conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 
2025-05-07T20:26:02.9520699Z   xcb-util           conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 
2025-05-07T20:26:02.9521276Z   xcb-util-cursor    conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 
2025-05-07T20:26:02.9521954Z   xcb-util-image     conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 
2025-05-07T20:26:02.9522476Z   xcb-util-keysyms   conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 
2025-05-07T20:26:02.9523035Z   xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 
2025-05-07T20:26:02.9523550Z   xcb-util-wm        conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 
2025-05-07T20:26:02.9524053Z   xkeyboard-config   conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 
2025-05-07T20:26:02.9524557Z   xorg-libice        conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 
2025-05-07T20:26:02.9525019Z   xorg-libsm         conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 
2025-05-07T20:26:02.9525481Z   xorg-libx11        conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 
2025-05-07T20:26:02.9525947Z   xorg-libxau        conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 
2025-05-07T20:26:02.9526481Z   xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 
2025-05-07T20:26:02.9527045Z   xorg-libxdamage    conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 
2025-05-07T20:26:02.9527562Z   xorg-libxdmcp      conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 
2025-05-07T20:26:02.9528048Z   xorg-libxext       conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 
2025-05-07T20:26:02.9528549Z   xorg-libxfixes     conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 
2025-05-07T20:26:02.9529041Z   xorg-libxi         conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 
2025-05-07T20:26:02.9529526Z   xorg-libxrandr     conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 
2025-05-07T20:26:02.9530052Z   xorg-libxrender    conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 
2025-05-07T20:26:02.9530570Z   xorg-libxtst       conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 
2025-05-07T20:26:02.9531007Z   zstd               conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 
2025-05-07T20:26:02.9531246Z 
2025-05-07T20:26:02.9531507Z The following packages will be UPDATED:
2025-05-07T20:26:02.9531715Z 
2025-05-07T20:26:02.9531877Z   libsqlite                               3.46.0-hde9e2c9_0 --> 3.49.2-hee588c1_0 
2025-05-07T20:26:02.9532283Z   libzlib                                 1.2.13-h4ab18f5_6 --> 1.3.1-hb9d3cd8_2 
2025-05-07T20:26:02.9532657Z   zlib                                    1.2.13-h4ab18f5_6 --> 1.3.1-hb9d3cd8_2 
2025-05-07T20:26:02.9532890Z 
2025-05-07T20:26:02.9533225Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:26:02.9533534Z 
2025-05-07T20:26:02.9533788Z   sqlite                pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 
2025-05-07T20:26:02.9534362Z   tk                        pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 
2025-05-07T20:26:02.9534680Z 
2025-05-07T20:26:02.9534702Z 
2025-05-07T20:26:02.9534707Z 
2025-05-07T20:26:02.9534845Z Downloading and Extracting Packages: ...working...
2025-05-07T20:26:02.9535234Z nsight-compute-2024. | 443.1 MB  |            |   0% 
2025-05-07T20:26:02.9535463Z 
2025-05-07T20:26:02.9535863Z libcublas-12.6.4.1   | 256.2 MB  |            |   0% [A
2025-05-07T20:26:02.9536099Z 
2025-05-07T20:26:02.9536102Z 
2025-05-07T20:26:02.9536311Z libcufft-11.3.0.4    | 156.2 MB  |            |   0% [A[A
2025-05-07T20:26:02.9536556Z 
2025-05-07T20:26:02.9536559Z 
2025-05-07T20:26:02.9536563Z 
2025-05-07T20:26:02.9536788Z libcusparse-12.5.4.2 | 118.6 MB  |            |   0% [A[A[A
2025-05-07T20:26:02.9537049Z 
2025-05-07T20:26:02.9537053Z 
2025-05-07T20:26:02.9537057Z 
2025-05-07T20:26:02.9537060Z 
2025-05-07T20:26:02.9543018Z cuda-nsight-12.6.77  | 113.2 MB  |            |   0% [A[A[A[A
2025-05-07T20:26:02.9543368Z 
2025-05-07T20:26:02.9543375Z 
2025-05-07T20:26:02.9543380Z 
2025-05-07T20:26:02.9543385Z 
2025-05-07T20:26:02.9543390Z 
2025-05-07T20:26:02.9544260Z cuda-nvvp-12.6.80    | 109.3 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:26:02.9544548Z 
2025-05-07T20:26:02.9544713Z 
2025-05-07T20:26:02.9544718Z 
2025-05-07T20:26:02.9544732Z 
2025-05-07T20:26:02.9544736Z 
2025-05-07T20:26:02.9544743Z 
2025-05-07T20:26:02.9545780Z libcusolver-11.7.1.2 | 95.8 MB   |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:02.9546063Z 
2025-05-07T20:26:02.9546067Z 
2025-05-07T20:26:02.9546078Z 
2025-05-07T20:26:02.9546081Z 
2025-05-07T20:26:02.9546085Z 
2025-05-07T20:26:02.9546089Z 
2025-05-07T20:26:02.9546095Z 
2025-05-07T20:26:02.9547632Z libnpp-12.3.1.54     | 93.4 MB   |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:26:02.9547959Z 
2025-05-07T20:26:02.9547963Z 
2025-05-07T20:26:02.9547967Z 
2025-05-07T20:26:02.9547970Z 
2025-05-07T20:26:02.9547974Z 
2025-05-07T20:26:02.9547977Z 
2025-05-07T20:26:02.9547981Z 
2025-05-07T20:26:02.9547985Z 
2025-05-07T20:26:02.9558350Z cuda-nvdisasm-12.6.7 | 47.6 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:26:02.9558757Z 
2025-05-07T20:26:02.9558763Z 
2025-05-07T20:26:02.9558768Z 
2025-05-07T20:26:02.9558783Z 
2025-05-07T20:26:02.9558795Z 
2025-05-07T20:26:02.9558801Z 
2025-05-07T20:26:02.9558806Z 
2025-05-07T20:26:02.9558811Z 
2025-05-07T20:26:02.9560545Z 
2025-05-07T20:26:02.9562309Z libcurand-10.3.7.77  | 39.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.9562700Z 
2025-05-07T20:26:02.9562706Z 
2025-05-07T20:26:02.9562711Z 
2025-05-07T20:26:02.9562716Z 
2025-05-07T20:26:02.9562721Z 
2025-05-07T20:26:02.9562726Z 
2025-05-07T20:26:02.9562731Z 
2025-05-07T20:26:02.9562736Z 
2025-05-07T20:26:02.9562742Z 
2025-05-07T20:26:02.9562747Z 
2025-05-07T20:26:02.9563620Z gds-tools-1.11.1.6   | 37.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.9564002Z 
2025-05-07T20:26:02.9564007Z 
2025-05-07T20:26:02.9564013Z 
2025-05-07T20:26:02.9564018Z 
2025-05-07T20:26:02.9564023Z 
2025-05-07T20:26:02.9564028Z 
2025-05-07T20:26:02.9564033Z 
2025-05-07T20:26:02.9564038Z 
2025-05-07T20:26:02.9564054Z 
2025-05-07T20:26:02.9564060Z 
2025-05-07T20:26:02.9564065Z 
2025-05-07T20:26:02.9567391Z cuda-nvcc-tools-12.6 | 23.0 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.9567812Z 
2025-05-07T20:26:02.9567825Z 
2025-05-07T20:26:02.9567831Z 
2025-05-07T20:26:02.9567836Z 
2025-05-07T20:26:02.9567841Z 
2025-05-07T20:26:02.9567846Z 
2025-05-07T20:26:02.9567851Z 
2025-05-07T20:26:02.9567856Z 
2025-05-07T20:26:02.9567861Z 
2025-05-07T20:26:02.9567866Z 
2025-05-07T20:26:02.9567871Z 
2025-05-07T20:26:02.9567876Z 
2025-05-07T20:26:02.9576337Z cuda-nvrtc-12.6.85   | 17.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.9576744Z 
2025-05-07T20:26:02.9576750Z 
2025-05-07T20:26:02.9576755Z 
2025-05-07T20:26:02.9576760Z 
2025-05-07T20:26:02.9576765Z 
2025-05-07T20:26:02.9576770Z 
2025-05-07T20:26:02.9576775Z 
2025-05-07T20:26:02.9576780Z 
2025-05-07T20:26:02.9576785Z 
2025-05-07T20:26:02.9576790Z 
2025-05-07T20:26:02.9576795Z 
2025-05-07T20:26:02.9576800Z 
2025-05-07T20:26:02.9579273Z 
2025-05-07T20:26:02.9580684Z libnvjitlink-12.6.85 | 14.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.9581107Z 
2025-05-07T20:26:02.9581113Z 
2025-05-07T20:26:02.9581118Z 
2025-05-07T20:26:02.9581123Z 
2025-05-07T20:26:02.9581128Z 
2025-05-07T20:26:02.9581134Z 
2025-05-07T20:26:02.9581139Z 
2025-05-07T20:26:02.9581158Z 
2025-05-07T20:26:02.9581162Z 
2025-05-07T20:26:02.9581167Z 
2025-05-07T20:26:02.9581172Z 
2025-05-07T20:26:02.9581177Z 
2025-05-07T20:26:02.9581182Z 
2025-05-07T20:26:02.9581187Z 
2025-05-07T20:26:02.9582293Z cuda-nvcc-dev_linux- | 10.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.9582707Z 
2025-05-07T20:26:02.9582713Z 
2025-05-07T20:26:02.9582718Z 
2025-05-07T20:26:02.9582723Z 
2025-05-07T20:26:02.9582728Z 
2025-05-07T20:26:02.9582734Z 
2025-05-07T20:26:02.9582739Z 
2025-05-07T20:26:02.9582744Z 
2025-05-07T20:26:02.9582749Z 
2025-05-07T20:26:02.9582754Z 
2025-05-07T20:26:02.9582759Z 
2025-05-07T20:26:02.9582769Z 
2025-05-07T20:26:02.9582774Z 
2025-05-07T20:26:02.9582786Z 
2025-05-07T20:26:02.9582987Z 
2025-05-07T20:26:02.9585827Z cuda-nvvm-tools-12.6 | 10.4 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.9586245Z 
2025-05-07T20:26:02.9586250Z 
2025-05-07T20:26:02.9586255Z 
2025-05-07T20:26:02.9586260Z 
2025-05-07T20:26:02.9586265Z 
2025-05-07T20:26:02.9586271Z 
2025-05-07T20:26:02.9586276Z 
2025-05-07T20:26:02.9586281Z 
2025-05-07T20:26:02.9586286Z 
2025-05-07T20:26:02.9586291Z 
2025-05-07T20:26:02.9586304Z 
2025-05-07T20:26:02.9586310Z 
2025-05-07T20:26:02.9586315Z 
2025-05-07T20:26:02.9586320Z 
2025-05-07T20:26:02.9586325Z 
2025-05-07T20:26:02.9586330Z 
2025-05-07T20:26:02.9587145Z cuda-sanitizer-api-1 | 8.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.9587580Z 
2025-05-07T20:26:02.9587585Z 
2025-05-07T20:26:02.9587591Z 
2025-05-07T20:26:02.9587595Z 
2025-05-07T20:26:02.9587601Z 
2025-05-07T20:26:02.9587614Z 
2025-05-07T20:26:02.9587619Z 
2025-05-07T20:26:02.9587624Z 
2025-05-07T20:26:02.9587639Z 
2025-05-07T20:26:02.9587652Z 
2025-05-07T20:26:02.9587657Z 
2025-05-07T20:26:02.9587662Z 
2025-05-07T20:26:02.9587667Z 
2025-05-07T20:26:02.9587672Z 
2025-05-07T20:26:02.9587677Z 
2025-05-07T20:26:02.9587682Z 
2025-05-07T20:26:02.9587687Z 
2025-05-07T20:26:02.9588618Z cuda-nvvm-impl-12.6. | 7.7 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.9589026Z 
2025-05-07T20:26:02.9589031Z 
2025-05-07T20:26:02.9589036Z 
2025-05-07T20:26:02.9589041Z 
2025-05-07T20:26:02.9589046Z 
2025-05-07T20:26:02.9589051Z 
2025-05-07T20:26:02.9589056Z 
2025-05-07T20:26:02.9589076Z 
2025-05-07T20:26:02.9589081Z 
2025-05-07T20:26:02.9589087Z 
2025-05-07T20:26:02.9589092Z 
2025-05-07T20:26:02.9589097Z 
2025-05-07T20:26:02.9589102Z 
2025-05-07T20:26:02.9589107Z 
2025-05-07T20:26:02.9589112Z 
2025-05-07T20:26:02.9589117Z 
2025-05-07T20:26:02.9589122Z 
2025-05-07T20:26:02.9589127Z 
2025-05-07T20:26:02.9590357Z libglib-2.84.0       | 3.8 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.9590768Z 
2025-05-07T20:26:02.9590774Z 
2025-05-07T20:26:02.9590779Z 
2025-05-07T20:26:02.9590784Z 
2025-05-07T20:26:02.9590789Z 
2025-05-07T20:26:02.9590794Z 
2025-05-07T20:26:02.9590807Z 
2025-05-07T20:26:02.9590812Z 
2025-05-07T20:26:02.9590817Z 
2025-05-07T20:26:02.9590822Z 
2025-05-07T20:26:02.9590827Z 
2025-05-07T20:26:02.9590833Z 
2025-05-07T20:26:02.9590837Z 
2025-05-07T20:26:02.9590843Z 
2025-05-07T20:26:02.9590847Z 
2025-05-07T20:26:02.9590853Z 
2025-05-07T20:26:02.9590858Z 
2025-05-07T20:26:02.9590873Z 
2025-05-07T20:26:02.9590879Z 
2025-05-07T20:26:03.0492894Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:03.0495218Z 
2025-05-07T20:26:03.0504869Z libcublas-12.6.4.1   | 256.2 MB  |            |   1% [A
2025-05-07T20:26:03.0505300Z 
2025-05-07T20:26:03.0505992Z 
2025-05-07T20:26:03.0527885Z libcufft-11.3.0.4    | 156.2 MB  | 2          |   2% [A[A
2025-05-07T20:26:03.0528235Z 
2025-05-07T20:26:03.0528289Z 
2025-05-07T20:26:03.0528621Z 
2025-05-07T20:26:03.0533542Z libcusparse-12.5.4.2 | 118.6 MB  | 2          |   2% [A[A[A
2025-05-07T20:26:03.0534211Z 
2025-05-07T20:26:03.0534222Z 
2025-05-07T20:26:03.0534227Z 
2025-05-07T20:26:03.0534233Z 
2025-05-07T20:26:03.0567916Z cuda-nsight-12.6.77  | 113.2 MB  |            |   0% [A[A[A[A
2025-05-07T20:26:03.1495287Z nsight-compute-2024. | 443.1 MB  |            |   0% 
2025-05-07T20:26:03.1495669Z 
2025-05-07T20:26:03.1528580Z libcublas-12.6.4.1   | 256.2 MB  | 1          |   2% [A
2025-05-07T20:26:03.1528843Z 
2025-05-07T20:26:03.1528847Z 
2025-05-07T20:26:03.1529045Z 
2025-05-07T20:26:03.1537411Z libcusparse-12.5.4.2 | 118.6 MB  | 5          |   5% [A[A[A
2025-05-07T20:26:03.1537776Z 
2025-05-07T20:26:03.1537780Z 
2025-05-07T20:26:03.1537784Z 
2025-05-07T20:26:03.1539029Z 
2025-05-07T20:26:03.1577379Z cuda-nsight-12.6.77  | 113.2 MB  | 2          |   3% [A[A[A[A
2025-05-07T20:26:03.1646667Z nsight-compute-2024. | 443.1 MB  |            |   1% 
2025-05-07T20:26:03.1647212Z 
2025-05-07T20:26:03.1648393Z 
2025-05-07T20:26:03.2498916Z libcufft-11.3.0.4    | 156.2 MB  | 4          |   5% [A[A
2025-05-07T20:26:03.2500391Z 
2025-05-07T20:26:03.2529164Z libcublas-12.6.4.1   | 256.2 MB  | 3          |   3% [A
2025-05-07T20:26:03.2529491Z 
2025-05-07T20:26:03.2529495Z 
2025-05-07T20:26:03.2530728Z 
2025-05-07T20:26:03.2539399Z libcusparse-12.5.4.2 | 118.6 MB  | 7          |   8% [A[A[A
2025-05-07T20:26:03.2539674Z 
2025-05-07T20:26:03.2539678Z 
2025-05-07T20:26:03.2539682Z 
2025-05-07T20:26:03.2543458Z 
2025-05-07T20:26:03.2580315Z cuda-nsight-12.6.77  | 113.2 MB  | 6          |   6% [A[A[A[A
2025-05-07T20:26:03.2759158Z nsight-compute-2024. | 443.1 MB  | 1          |   1% 
2025-05-07T20:26:03.2759727Z 
2025-05-07T20:26:03.2761808Z 
2025-05-07T20:26:03.3499446Z libcufft-11.3.0.4    | 156.2 MB  | 6          |   7% [A[A
2025-05-07T20:26:03.3502156Z 
2025-05-07T20:26:03.3534653Z libcublas-12.6.4.1   | 256.2 MB  | 4          |   5% [A
2025-05-07T20:26:03.3534992Z 
2025-05-07T20:26:03.3534998Z 
2025-05-07T20:26:03.3536284Z 
2025-05-07T20:26:03.3541929Z libcusparse-12.5.4.2 | 118.6 MB  | #          |  11% [A[A[A
2025-05-07T20:26:03.3542284Z 
2025-05-07T20:26:03.3542296Z 
2025-05-07T20:26:03.3542300Z 
2025-05-07T20:26:03.3543007Z 
2025-05-07T20:26:03.3581337Z cuda-nsight-12.6.77  | 113.2 MB  | 9          |   9% [A[A[A[A
2025-05-07T20:26:03.3761642Z nsight-compute-2024. | 443.1 MB  | 2          |   2% 
2025-05-07T20:26:03.3761980Z 
2025-05-07T20:26:03.3762073Z 
2025-05-07T20:26:03.4536747Z libcufft-11.3.0.4    | 156.2 MB  | 9          |   9% [A[A
2025-05-07T20:26:03.4537035Z 
2025-05-07T20:26:03.4537039Z 
2025-05-07T20:26:03.4538853Z 
2025-05-07T20:26:03.4543602Z libcusparse-12.5.4.2 | 118.6 MB  | #4         |  14% [A[A[A
2025-05-07T20:26:03.4543955Z 
2025-05-07T20:26:03.4543959Z 
2025-05-07T20:26:03.4543963Z 
2025-05-07T20:26:03.4548309Z 
2025-05-07T20:26:03.4563289Z cuda-nsight-12.6.77  | 113.2 MB  | #2         |  12% [A[A[A[A
2025-05-07T20:26:03.4563651Z 
2025-05-07T20:26:03.4583320Z libcublas-12.6.4.1   | 256.2 MB  | 6          |   6% [A
2025-05-07T20:26:03.4768414Z nsight-compute-2024. | 443.1 MB  | 2          |   3% 
2025-05-07T20:26:03.4768747Z 
2025-05-07T20:26:03.4768761Z 
2025-05-07T20:26:03.5537778Z libcufft-11.3.0.4    | 156.2 MB  | #1         |  11% [A[A
2025-05-07T20:26:03.5538055Z 
2025-05-07T20:26:03.5538059Z 
2025-05-07T20:26:03.5541000Z 
2025-05-07T20:26:03.5548721Z libcusparse-12.5.4.2 | 118.6 MB  | #7         |  17% [A[A[A
2025-05-07T20:26:03.5549031Z 
2025-05-07T20:26:03.5549035Z 
2025-05-07T20:26:03.5549039Z 
2025-05-07T20:26:03.5553386Z 
2025-05-07T20:26:03.5568534Z cuda-nsight-12.6.77  | 113.2 MB  | #5         |  16% [A[A[A[A
2025-05-07T20:26:03.5568902Z 
2025-05-07T20:26:03.5701003Z libcublas-12.6.4.1   | 256.2 MB  | 7          |   8% [A
2025-05-07T20:26:03.5777178Z nsight-compute-2024. | 443.1 MB  | 3          |   4% 
2025-05-07T20:26:03.5777517Z 
2025-05-07T20:26:03.5777524Z 
2025-05-07T20:26:03.6551462Z libcufft-11.3.0.4    | 156.2 MB  | #3         |  13% [A[A
2025-05-07T20:26:03.6551765Z 
2025-05-07T20:26:03.6551770Z 
2025-05-07T20:26:03.6551773Z 
2025-05-07T20:26:03.6552559Z 
2025-05-07T20:26:03.6573809Z cuda-nsight-12.6.77  | 113.2 MB  | #8         |  19% [A[A[A[A
2025-05-07T20:26:03.6575540Z 
2025-05-07T20:26:03.6630981Z libcublas-12.6.4.1   | 256.2 MB  | 8          |   9% [A
2025-05-07T20:26:03.6631238Z 
2025-05-07T20:26:03.6631242Z 
2025-05-07T20:26:03.6633500Z 
2025-05-07T20:26:03.6702161Z libcusparse-12.5.4.2 | 118.6 MB  | ##         |  20% [A[A[A
2025-05-07T20:26:03.6779700Z nsight-compute-2024. | 443.1 MB  | 4          |   4% 
2025-05-07T20:26:03.6780057Z 
2025-05-07T20:26:03.6780063Z 
2025-05-07T20:26:03.7597539Z libcufft-11.3.0.4    | 156.2 MB  | #5         |  16% [A[A
2025-05-07T20:26:03.7603293Z 
2025-05-07T20:26:03.7663459Z libcublas-12.6.4.1   | 256.2 MB  | #          |  10% [A
2025-05-07T20:26:03.7663789Z 
2025-05-07T20:26:03.7663793Z 
2025-05-07T20:26:03.7663826Z 
2025-05-07T20:26:03.7664077Z 
2025-05-07T20:26:03.7704581Z cuda-nsight-12.6.77  | 113.2 MB  | ##1        |  22% [A[A[A[A
2025-05-07T20:26:03.7709067Z nsight-compute-2024. | 443.1 MB  | 5          |   5% 
2025-05-07T20:26:03.7709402Z 
2025-05-07T20:26:03.7709525Z 
2025-05-07T20:26:03.7709552Z 
2025-05-07T20:26:03.7817086Z libcusparse-12.5.4.2 | 118.6 MB  | ##3        |  23% [A[A[A
2025-05-07T20:26:03.7817382Z 
2025-05-07T20:26:03.7817386Z 
2025-05-07T20:26:03.8636541Z libcufft-11.3.0.4    | 156.2 MB  | #8         |  18% [A[A
2025-05-07T20:26:03.8639337Z 
2025-05-07T20:26:03.8665443Z libcublas-12.6.4.1   | 256.2 MB  | #1         |  12% [A
2025-05-07T20:26:03.8665763Z 
2025-05-07T20:26:03.8665767Z 
2025-05-07T20:26:03.8665771Z 
2025-05-07T20:26:03.8666230Z 
2025-05-07T20:26:03.8706817Z cuda-nsight-12.6.77  | 113.2 MB  | ##5        |  25% [A[A[A[A
2025-05-07T20:26:03.8707719Z nsight-compute-2024. | 443.1 MB  | 5          |   6% 
2025-05-07T20:26:03.8708048Z 
2025-05-07T20:26:03.8708053Z 
2025-05-07T20:26:03.8710328Z 
2025-05-07T20:26:03.8820042Z libcusparse-12.5.4.2 | 118.6 MB  | ##6        |  26% [A[A[A
2025-05-07T20:26:03.8820324Z 
2025-05-07T20:26:03.8820940Z 
2025-05-07T20:26:03.9637509Z libcufft-11.3.0.4    | 156.2 MB  | ##         |  20% [A[A
2025-05-07T20:26:03.9638499Z 
2025-05-07T20:26:03.9668684Z libcublas-12.6.4.1   | 256.2 MB  | #2         |  13% [A
2025-05-07T20:26:03.9669028Z 
2025-05-07T20:26:03.9669032Z 
2025-05-07T20:26:03.9669036Z 
2025-05-07T20:26:03.9674884Z 
2025-05-07T20:26:03.9708801Z cuda-nsight-12.6.77  | 113.2 MB  | ##8        |  29% [A[A[A[A
2025-05-07T20:26:03.9711835Z nsight-compute-2024. | 443.1 MB  | 6          |   7% 
2025-05-07T20:26:03.9712164Z 
2025-05-07T20:26:03.9712168Z 
2025-05-07T20:26:03.9712172Z 
2025-05-07T20:26:03.9822709Z libcusparse-12.5.4.2 | 118.6 MB  | ##9        |  29% [A[A[A
2025-05-07T20:26:03.9823073Z 
2025-05-07T20:26:03.9828166Z 
2025-05-07T20:26:04.0639372Z libcufft-11.3.0.4    | 156.2 MB  | ##2        |  23% [A[A
2025-05-07T20:26:04.0642409Z 
2025-05-07T20:26:04.0671386Z libcublas-12.6.4.1   | 256.2 MB  | #4         |  15% [A
2025-05-07T20:26:04.0671739Z 
2025-05-07T20:26:04.0671745Z 
2025-05-07T20:26:04.0671754Z 
2025-05-07T20:26:04.0673533Z 
2025-05-07T20:26:04.0718552Z cuda-nsight-12.6.77  | 113.2 MB  | ###2       |  32% [A[A[A[A
2025-05-07T20:26:04.0718951Z 
2025-05-07T20:26:04.0718957Z 
2025-05-07T20:26:04.0720346Z 
2025-05-07T20:26:04.0825700Z libcusparse-12.5.4.2 | 118.6 MB  | ###2       |  33% [A[A[A
2025-05-07T20:26:04.0825986Z 
2025-05-07T20:26:04.0825990Z 
2025-05-07T20:26:04.0962485Z libcufft-11.3.0.4    | 156.2 MB  | ##5        |  25% [A[A
2025-05-07T20:26:04.1720150Z nsight-compute-2024. | 443.1 MB  | 7          |   8% 
2025-05-07T20:26:04.1720427Z 
2025-05-07T20:26:04.1720431Z 
2025-05-07T20:26:04.1723699Z 
2025-05-07T20:26:04.1742970Z libcusparse-12.5.4.2 | 118.6 MB  | ###5       |  36% [A[A[A
2025-05-07T20:26:04.1743382Z 
2025-05-07T20:26:04.1743387Z 
2025-05-07T20:26:04.1743390Z 
2025-05-07T20:26:04.1745817Z 
2025-05-07T20:26:04.1759695Z cuda-nsight-12.6.77  | 113.2 MB  | ###5       |  35% [A[A[A[A
2025-05-07T20:26:04.1763715Z 
2025-05-07T20:26:04.1862494Z libcublas-12.6.4.1   | 256.2 MB  | #5         |  16% [A
2025-05-07T20:26:04.1862806Z 
2025-05-07T20:26:04.1862844Z 
2025-05-07T20:26:04.1963707Z libcufft-11.3.0.4    | 156.2 MB  | ##7        |  28% [A[A
2025-05-07T20:26:04.2744954Z nsight-compute-2024. | 443.1 MB  | 8          |   8% 
2025-05-07T20:26:04.2745301Z 
2025-05-07T20:26:04.2745306Z 
2025-05-07T20:26:04.2745311Z 
2025-05-07T20:26:04.2746018Z 
2025-05-07T20:26:04.2760390Z cuda-nsight-12.6.77  | 113.2 MB  | ###8       |  39% [A[A[A[A
2025-05-07T20:26:04.2761472Z 
2025-05-07T20:26:04.2813301Z libcublas-12.6.4.1   | 256.2 MB  | #7         |  17% [A
2025-05-07T20:26:04.2813590Z 
2025-05-07T20:26:04.2813595Z 
2025-05-07T20:26:04.2814025Z 
2025-05-07T20:26:04.2919156Z libcusparse-12.5.4.2 | 118.6 MB  | ###8       |  39% [A[A[A
2025-05-07T20:26:04.2919494Z 
2025-05-07T20:26:04.2920601Z 
2025-05-07T20:26:04.2963835Z libcufft-11.3.0.4    | 156.2 MB  | ##9        |  30% [A[A
2025-05-07T20:26:04.3750758Z nsight-compute-2024. | 443.1 MB  | 9          |   9% 
2025-05-07T20:26:04.3751092Z 
2025-05-07T20:26:04.3751096Z 
2025-05-07T20:26:04.3751100Z 
2025-05-07T20:26:04.3753261Z 
2025-05-07T20:26:04.3764440Z cuda-nsight-12.6.77  | 113.2 MB  | ####2      |  42% [A[A[A[A
2025-05-07T20:26:04.3764766Z 
2025-05-07T20:26:04.3814080Z libcublas-12.6.4.1   | 256.2 MB  | #8         |  19% [A
2025-05-07T20:26:04.3814337Z 
2025-05-07T20:26:04.3814440Z 
2025-05-07T20:26:04.3815258Z 
2025-05-07T20:26:04.3921263Z libcusparse-12.5.4.2 | 118.6 MB  | ####2      |  42% [A[A[A
2025-05-07T20:26:04.3921542Z 
2025-05-07T20:26:04.3922727Z 
2025-05-07T20:26:04.3966330Z libcufft-11.3.0.4    | 156.2 MB  | ###2       |  32% [A[A
2025-05-07T20:26:04.4766041Z nsight-compute-2024. | 443.1 MB  | #          |  10% 
2025-05-07T20:26:04.4770132Z 
2025-05-07T20:26:04.4816611Z libcublas-12.6.4.1   | 256.2 MB  | ##         |  20% [A
2025-05-07T20:26:04.4816942Z 
2025-05-07T20:26:04.4816951Z 
2025-05-07T20:26:04.4816960Z 
2025-05-07T20:26:04.4841790Z libcusparse-12.5.4.2 | 118.6 MB  | ####5      |  45% [A[A[A
2025-05-07T20:26:04.4842106Z 
2025-05-07T20:26:04.4842114Z 
2025-05-07T20:26:04.4842121Z 
2025-05-07T20:26:04.4843416Z 
2025-05-07T20:26:04.4923787Z cuda-nsight-12.6.77  | 113.2 MB  | ####5      |  45% [A[A[A[A
2025-05-07T20:26:04.4924091Z 
2025-05-07T20:26:04.4924096Z 
2025-05-07T20:26:04.4966622Z libcufft-11.3.0.4    | 156.2 MB  | ###4       |  35% [A[A
2025-05-07T20:26:04.5768921Z nsight-compute-2024. | 443.1 MB  | #          |  11% 
2025-05-07T20:26:04.5770428Z 
2025-05-07T20:26:04.5821530Z libcublas-12.6.4.1   | 256.2 MB  | ##1        |  22% [A
2025-05-07T20:26:04.5821809Z 
2025-05-07T20:26:04.5821813Z 
2025-05-07T20:26:04.5823599Z 
2025-05-07T20:26:04.5842757Z libcusparse-12.5.4.2 | 118.6 MB  | ####8      |  48% [A[A[A
2025-05-07T20:26:04.5843114Z 
2025-05-07T20:26:04.5843118Z 
2025-05-07T20:26:04.5843405Z 
2025-05-07T20:26:04.5844671Z 
2025-05-07T20:26:04.5954128Z cuda-nsight-12.6.77  | 113.2 MB  | ####8      |  49% [A[A[A[A
2025-05-07T20:26:04.5954436Z 
2025-05-07T20:26:04.5954996Z 
2025-05-07T20:26:04.5970238Z libcufft-11.3.0.4    | 156.2 MB  | ###6       |  37% [A[A
2025-05-07T20:26:04.6769070Z nsight-compute-2024. | 443.1 MB  | #1         |  12% 
2025-05-07T20:26:04.6777446Z 
2025-05-07T20:26:04.6843765Z libcublas-12.6.4.1   | 256.2 MB  | ##3        |  24% [A
2025-05-07T20:26:04.6844018Z 
2025-05-07T20:26:04.6844206Z 
2025-05-07T20:26:04.6844224Z 
2025-05-07T20:26:04.6846124Z 
2025-05-07T20:26:04.6860968Z cuda-nsight-12.6.77  | 113.2 MB  | #####2     |  53% [A[A[A[A
2025-05-07T20:26:04.6861280Z 
2025-05-07T20:26:04.6861286Z 
2025-05-07T20:26:04.6864354Z 
2025-05-07T20:26:04.6972541Z libcusparse-12.5.4.2 | 118.6 MB  | #####1     |  52% [A[A[A
2025-05-07T20:26:04.6992592Z nsight-compute-2024. | 443.1 MB  | #2         |  13% 
2025-05-07T20:26:04.6992862Z 
2025-05-07T20:26:04.6992866Z 
2025-05-07T20:26:04.7771660Z libcufft-11.3.0.4    | 156.2 MB  | ###9       |  39% [A[A
2025-05-07T20:26:04.7772510Z 
2025-05-07T20:26:04.7846430Z libcublas-12.6.4.1   | 256.2 MB  | ##5        |  25% [A
2025-05-07T20:26:04.7846763Z 
2025-05-07T20:26:04.7846772Z 
2025-05-07T20:26:04.7846780Z 
2025-05-07T20:26:04.7849226Z 
2025-05-07T20:26:04.7865141Z cuda-nsight-12.6.77  | 113.2 MB  | #####5     |  56% [A[A[A[A
2025-05-07T20:26:04.7865492Z 
2025-05-07T20:26:04.7865497Z 
2025-05-07T20:26:04.7865863Z 
2025-05-07T20:26:04.7978952Z libcusparse-12.5.4.2 | 118.6 MB  | #####4     |  55% [A[A[A
2025-05-07T20:26:04.7993492Z nsight-compute-2024. | 443.1 MB  | #3         |  13% 
2025-05-07T20:26:04.7993773Z 
2025-05-07T20:26:04.7993777Z 
2025-05-07T20:26:04.8772694Z libcufft-11.3.0.4    | 156.2 MB  | ####1      |  42% [A[A
2025-05-07T20:26:04.8774093Z 
2025-05-07T20:26:04.8851734Z libcublas-12.6.4.1   | 256.2 MB  | ##6        |  27% [A
2025-05-07T20:26:04.8851996Z 
2025-05-07T20:26:04.8852000Z 
2025-05-07T20:26:04.8852042Z 
2025-05-07T20:26:04.8852620Z 
2025-05-07T20:26:04.8866492Z cuda-nsight-12.6.77  | 113.2 MB  | #####9     |  59% [A[A[A[A
2025-05-07T20:26:04.8866824Z 
2025-05-07T20:26:04.8866828Z 
2025-05-07T20:26:04.8868405Z 
2025-05-07T20:26:04.8980559Z libcusparse-12.5.4.2 | 118.6 MB  | #####8     |  58% [A[A[A
2025-05-07T20:26:04.8996133Z nsight-compute-2024. | 443.1 MB  | #4         |  14% 
2025-05-07T20:26:04.8996391Z 
2025-05-07T20:26:04.8996395Z 
2025-05-07T20:26:04.9853898Z libcufft-11.3.0.4    | 156.2 MB  | ####4      |  44% [A[A
2025-05-07T20:26:04.9854183Z 
2025-05-07T20:26:04.9854187Z 
2025-05-07T20:26:04.9854191Z 
2025-05-07T20:26:04.9855407Z 
2025-05-07T20:26:04.9867870Z cuda-nsight-12.6.77  | 113.2 MB  | ######2    |  63% [A[A[A[A
2025-05-07T20:26:04.9868156Z 
2025-05-07T20:26:04.9868160Z 
2025-05-07T20:26:04.9868164Z 
2025-05-07T20:26:04.9983142Z libcusparse-12.5.4.2 | 118.6 MB  | ######1    |  62% [A[A[A
2025-05-07T20:26:04.9998577Z nsight-compute-2024. | 443.1 MB  | #5         |  15% 
2025-05-07T20:26:04.9998916Z 
2025-05-07T20:26:05.0002664Z 
2025-05-07T20:26:05.0008913Z libcufft-11.3.0.4    | 156.2 MB  | ####6      |  47% [A[A
2025-05-07T20:26:05.0009196Z 
2025-05-07T20:26:05.0941415Z libcublas-12.6.4.1   | 256.2 MB  | ##8        |  28% [A
2025-05-07T20:26:05.0941689Z 
2025-05-07T20:26:05.0941701Z 
2025-05-07T20:26:05.0941705Z 
2025-05-07T20:26:05.0943791Z 
2025-05-07T20:26:05.0956904Z cuda-nsight-12.6.77  | 113.2 MB  | ######6    |  66% [A[A[A[A
2025-05-07T20:26:05.0957194Z 
2025-05-07T20:26:05.0957207Z 
2025-05-07T20:26:05.0958014Z 
2025-05-07T20:26:05.1014298Z libcusparse-12.5.4.2 | 118.6 MB  | ######4    |  65% [A[A[A
2025-05-07T20:26:05.1015304Z 
2025-05-07T20:26:05.1020342Z libcublas-12.6.4.1   | 256.2 MB  | ##9        |  30% [A
2025-05-07T20:26:05.1020600Z 
2025-05-07T20:26:05.1021164Z 
2025-05-07T20:26:05.1071436Z libcufft-11.3.0.4    | 156.2 MB  | ####9      |  49% [A[A
2025-05-07T20:26:05.2011594Z nsight-compute-2024. | 443.1 MB  | #6         |  16% 
2025-05-07T20:26:05.2012162Z 
2025-05-07T20:26:05.2012182Z 
2025-05-07T20:26:05.2012501Z 
2025-05-07T20:26:05.2017846Z libcusparse-12.5.4.2 | 118.6 MB  | ######8    |  68% [A[A[A
2025-05-07T20:26:05.2021589Z 
2025-05-07T20:26:05.2027502Z libcublas-12.6.4.1   | 256.2 MB  | ###1       |  31% [A
2025-05-07T20:26:05.2027766Z 
2025-05-07T20:26:05.2029000Z 
2025-05-07T20:26:05.2115706Z libcufft-11.3.0.4    | 156.2 MB  | #####1     |  52% [A[A
2025-05-07T20:26:05.2195302Z nsight-compute-2024. | 443.1 MB  | #6         |  17% 
2025-05-07T20:26:05.2195831Z 
2025-05-07T20:26:05.2195976Z 
2025-05-07T20:26:05.2195983Z 
2025-05-07T20:26:05.2198617Z 
2025-05-07T20:26:05.3019761Z cuda-nsight-12.6.77  | 113.2 MB  | ######9    |  70% [A[A[A[A
2025-05-07T20:26:05.3022105Z 
2025-05-07T20:26:05.3045942Z libcublas-12.6.4.1   | 256.2 MB  | ###2       |  33% [A
2025-05-07T20:26:05.3046274Z 
2025-05-07T20:26:05.3046281Z 
2025-05-07T20:26:05.3048725Z 
2025-05-07T20:26:05.3095177Z libcusparse-12.5.4.2 | 118.6 MB  | #######1   |  71% [A[A[A
2025-05-07T20:26:05.3095572Z 
2025-05-07T20:26:05.3095989Z 
2025-05-07T20:26:05.3127317Z libcufft-11.3.0.4    | 156.2 MB  | #####4     |  54% [A[A
2025-05-07T20:26:05.3199019Z nsight-compute-2024. | 443.1 MB  | #7         |  18% 
2025-05-07T20:26:05.3199291Z 
2025-05-07T20:26:05.3199298Z 
2025-05-07T20:26:05.3199303Z 
2025-05-07T20:26:05.3200780Z 
2025-05-07T20:26:05.4019662Z cuda-nsight-12.6.77  | 113.2 MB  | #######2   |  73% [A[A[A[A
2025-05-07T20:26:05.4021177Z 
2025-05-07T20:26:05.4066407Z libcublas-12.6.4.1   | 256.2 MB  | ###3       |  34% [A
2025-05-07T20:26:05.4066734Z 
2025-05-07T20:26:05.4066740Z 
2025-05-07T20:26:05.4066753Z 
2025-05-07T20:26:05.4171172Z libcusparse-12.5.4.2 | 118.6 MB  | #######4   |  74% [A[A[A
2025-05-07T20:26:05.4201351Z nsight-compute-2024. | 443.1 MB  | #8         |  19% 
2025-05-07T20:26:05.4201610Z 
2025-05-07T20:26:05.4201614Z 
2025-05-07T20:26:05.4201619Z 
2025-05-07T20:26:05.4203966Z 
2025-05-07T20:26:05.4249864Z cuda-nsight-12.6.77  | 113.2 MB  | #######5   |  76% [A[A[A[A
2025-05-07T20:26:05.4250509Z 
2025-05-07T20:26:05.4250515Z 
2025-05-07T20:26:05.5042635Z libcufft-11.3.0.4    | 156.2 MB  | #####6     |  57% [A[A
2025-05-07T20:26:05.5043706Z 
2025-05-07T20:26:05.5068095Z libcublas-12.6.4.1   | 256.2 MB  | ###5       |  35% [A
2025-05-07T20:26:05.5068373Z 
2025-05-07T20:26:05.5068378Z 
2025-05-07T20:26:05.5070329Z 
2025-05-07T20:26:05.5202557Z libcusparse-12.5.4.2 | 118.6 MB  | #######7   |  77% [A[A[A
2025-05-07T20:26:05.5202940Z 
2025-05-07T20:26:05.5202945Z 
2025-05-07T20:26:05.5202949Z 
2025-05-07T20:26:05.5202952Z 
2025-05-07T20:26:05.5220556Z cuda-nsight-12.6.77  | 113.2 MB  | #######9   |  79% [A[A[A[A
2025-05-07T20:26:05.5259922Z nsight-compute-2024. | 443.1 MB  | #9         |  19% 
2025-05-07T20:26:05.5260177Z 
2025-05-07T20:26:05.5260832Z 
2025-05-07T20:26:05.6048300Z libcufft-11.3.0.4    | 156.2 MB  | #####9     |  59% [A[A
2025-05-07T20:26:05.6050384Z 
2025-05-07T20:26:05.6080443Z libcublas-12.6.4.1   | 256.2 MB  | ###6       |  37% [A
2025-05-07T20:26:05.6080713Z 
2025-05-07T20:26:05.6080717Z 
2025-05-07T20:26:05.6080729Z 
2025-05-07T20:26:05.6203531Z libcusparse-12.5.4.2 | 118.6 MB  | ########   |  81% [A[A[A
2025-05-07T20:26:05.6203891Z 
2025-05-07T20:26:05.6203897Z 
2025-05-07T20:26:05.6203903Z 
2025-05-07T20:26:05.6205947Z 
2025-05-07T20:26:05.6302856Z cuda-nsight-12.6.77  | 113.2 MB  | ########2  |  83% [A[A[A[A
2025-05-07T20:26:05.6303149Z 
2025-05-07T20:26:05.6303154Z 
2025-05-07T20:26:05.6326668Z libcufft-11.3.0.4    | 156.2 MB  | ######1    |  61% [A[A
2025-05-07T20:26:05.7050859Z nsight-compute-2024. | 443.1 MB  | ##         |  20% 
2025-05-07T20:26:05.7051738Z 
2025-05-07T20:26:05.7094946Z libcublas-12.6.4.1   | 256.2 MB  | ###8       |  38% [A
2025-05-07T20:26:05.7095298Z 
2025-05-07T20:26:05.7095303Z 
2025-05-07T20:26:05.7095950Z 
2025-05-07T20:26:05.7272589Z libcusparse-12.5.4.2 | 118.6 MB  | ########3  |  84% [A[A[A
2025-05-07T20:26:05.7272917Z 
2025-05-07T20:26:05.7272922Z 
2025-05-07T20:26:05.7273186Z 
2025-05-07T20:26:05.7273202Z 
2025-05-07T20:26:05.7303565Z cuda-nsight-12.6.77  | 113.2 MB  | ########5  |  86% [A[A[A[A
2025-05-07T20:26:05.7303948Z 
2025-05-07T20:26:05.7303952Z 
2025-05-07T20:26:05.7356075Z libcufft-11.3.0.4    | 156.2 MB  | ######3    |  64% [A[A
2025-05-07T20:26:05.8051529Z nsight-compute-2024. | 443.1 MB  | ##         |  21% 
2025-05-07T20:26:05.8051813Z 
2025-05-07T20:26:05.8104395Z libcublas-12.6.4.1   | 256.2 MB  | ###9       |  40% [A
2025-05-07T20:26:05.8104666Z 
2025-05-07T20:26:05.8104672Z 
2025-05-07T20:26:05.8106179Z 
2025-05-07T20:26:05.8282872Z libcusparse-12.5.4.2 | 118.6 MB  | ########7  |  87% [A[A[A
2025-05-07T20:26:05.8283231Z 
2025-05-07T20:26:05.8283235Z 
2025-05-07T20:26:05.8283239Z 
2025-05-07T20:26:05.8284532Z 
2025-05-07T20:26:05.8307801Z cuda-nsight-12.6.77  | 113.2 MB  | ########9  |  89% [A[A[A[A
2025-05-07T20:26:05.8308084Z 
2025-05-07T20:26:05.8308088Z 
2025-05-07T20:26:05.8418242Z libcufft-11.3.0.4    | 156.2 MB  | ######6    |  66% [A[A
2025-05-07T20:26:05.9075372Z nsight-compute-2024. | 443.1 MB  | ##1        |  22% 
2025-05-07T20:26:05.9077697Z 
2025-05-07T20:26:05.9207141Z libcublas-12.6.4.1   | 256.2 MB  | ####1      |  41% [A
2025-05-07T20:26:05.9207409Z 
2025-05-07T20:26:05.9207421Z 
2025-05-07T20:26:05.9207850Z 
2025-05-07T20:26:05.9299360Z libcusparse-12.5.4.2 | 118.6 MB  | #########  |  90% [A[A[A
2025-05-07T20:26:05.9299761Z 
2025-05-07T20:26:05.9299767Z 
2025-05-07T20:26:05.9299783Z 
2025-05-07T20:26:05.9301500Z 
2025-05-07T20:26:05.9368512Z cuda-nsight-12.6.77  | 113.2 MB  | #########2 |  92% [A[A[A[A
2025-05-07T20:26:05.9368819Z 
2025-05-07T20:26:05.9368823Z 
2025-05-07T20:26:05.9419960Z libcufft-11.3.0.4    | 156.2 MB  | ######8    |  69% [A[A
2025-05-07T20:26:06.0076452Z nsight-compute-2024. | 443.1 MB  | ##2        |  23% 
2025-05-07T20:26:06.0077214Z 
2025-05-07T20:26:06.0209768Z libcublas-12.6.4.1   | 256.2 MB  | ####3      |  43% [A
2025-05-07T20:26:06.0210041Z 
2025-05-07T20:26:06.0210045Z 
2025-05-07T20:26:06.0212757Z 
2025-05-07T20:26:06.0324543Z libcusparse-12.5.4.2 | 118.6 MB  | #########3 |  93% [A[A[A
2025-05-07T20:26:06.0324827Z 
2025-05-07T20:26:06.0324834Z 
2025-05-07T20:26:06.0324839Z 
2025-05-07T20:26:06.0327291Z 
2025-05-07T20:26:06.0371220Z cuda-nsight-12.6.77  | 113.2 MB  | #########5 |  96% [A[A[A[A
2025-05-07T20:26:06.0371547Z 
2025-05-07T20:26:06.0371551Z 
2025-05-07T20:26:06.0425653Z libcufft-11.3.0.4    | 156.2 MB  | #######1   |  71% [A[A
2025-05-07T20:26:06.1118083Z nsight-compute-2024. | 443.1 MB  | ##3        |  23% 
2025-05-07T20:26:06.1118599Z 
2025-05-07T20:26:06.1330067Z libcublas-12.6.4.1   | 256.2 MB  | ####4      |  45% [A
2025-05-07T20:26:06.1330344Z 
2025-05-07T20:26:06.1330348Z 
2025-05-07T20:26:06.1330352Z 
2025-05-07T20:26:06.1330356Z 
2025-05-07T20:26:06.1372211Z cuda-nsight-12.6.77  | 113.2 MB  | #########8 |  99% [A[A[A[A
2025-05-07T20:26:06.1372586Z 
2025-05-07T20:26:06.1372597Z 
2025-05-07T20:26:06.1426076Z libcufft-11.3.0.4    | 156.2 MB  | #######3   |  74% [A[A
2025-05-07T20:26:06.1771096Z nsight-compute-2024. | 443.1 MB  | ##4        |  24% 
2025-05-07T20:26:06.1771425Z 
2025-05-07T20:26:06.1771431Z 
2025-05-07T20:26:06.1771437Z 
2025-05-07T20:26:06.2119516Z libcusparse-12.5.4.2 | 118.6 MB  | #########6 |  97% [A[A[A
2025-05-07T20:26:06.2120059Z 
2025-05-07T20:26:06.2372302Z libcublas-12.6.4.1   | 256.2 MB  | ####6      |  46% [A
2025-05-07T20:26:06.2372657Z 
2025-05-07T20:26:06.2372663Z 
2025-05-07T20:26:06.2427011Z libcufft-11.3.0.4    | 156.2 MB  | #######6   |  76% [A[A
2025-05-07T20:26:06.3121844Z nsight-compute-2024. | 443.1 MB  | ##5        |  25% 
2025-05-07T20:26:06.3122167Z 
2025-05-07T20:26:06.3374295Z libcublas-12.6.4.1   | 256.2 MB  | ####8      |  48% [A
2025-05-07T20:26:06.3374559Z 
2025-05-07T20:26:06.3374743Z 
2025-05-07T20:26:06.3427540Z libcufft-11.3.0.4    | 156.2 MB  | #######9   |  79% [A[A
2025-05-07T20:26:06.3750269Z nsight-compute-2024. | 443.1 MB  | ##6        |  26% 
2025-05-07T20:26:06.3750560Z 
2025-05-07T20:26:06.3750844Z 
2025-05-07T20:26:06.3751396Z 
2025-05-07T20:26:06.4125771Z libcusparse-12.5.4.2 | 118.6 MB  | #########9 |  99% [A[A[A
2025-05-07T20:26:06.4126496Z 
2025-05-07T20:26:06.4411911Z libcublas-12.6.4.1   | 256.2 MB  | ####9      |  50% [A
2025-05-07T20:26:06.4412193Z 
2025-05-07T20:26:06.4413371Z 
2025-05-07T20:26:06.4498484Z libcufft-11.3.0.4    | 156.2 MB  | ########1  |  82% [A[A
2025-05-07T20:26:06.5152221Z nsight-compute-2024. | 443.1 MB  | ##7        |  27% 
2025-05-07T20:26:06.5154124Z 
2025-05-07T20:26:06.5412815Z libcublas-12.6.4.1   | 256.2 MB  | #####1     |  52% [A
2025-05-07T20:26:06.5413077Z 
2025-05-07T20:26:06.5415041Z 
2025-05-07T20:26:06.5659590Z libcufft-11.3.0.4    | 156.2 MB  | ########4  |  85% [A[A
2025-05-07T20:26:06.6204758Z nsight-compute-2024. | 443.1 MB  | ##8        |  28% 
2025-05-07T20:26:06.6206314Z 
2025-05-07T20:26:06.6415021Z libcublas-12.6.4.1   | 256.2 MB  | #####3     |  53% [A
2025-05-07T20:26:06.6415284Z 
2025-05-07T20:26:06.6415289Z 
2025-05-07T20:26:06.6661761Z libcufft-11.3.0.4    | 156.2 MB  | ########7  |  87% [A[A
2025-05-07T20:26:06.7224904Z nsight-compute-2024. | 443.1 MB  | ##9        |  29% 
2025-05-07T20:26:06.7228807Z 
2025-05-07T20:26:06.7415102Z libcublas-12.6.4.1   | 256.2 MB  | #####4     |  55% [A
2025-05-07T20:26:06.7415360Z 
2025-05-07T20:26:06.7417277Z 
2025-05-07T20:26:06.7734663Z libcufft-11.3.0.4    | 156.2 MB  | #########  |  90% [A[A
2025-05-07T20:26:06.8229099Z nsight-compute-2024. | 443.1 MB  | ###        |  30% 
2025-05-07T20:26:06.8229565Z 
2025-05-07T20:26:06.8417942Z libcublas-12.6.4.1   | 256.2 MB  | #####6     |  56% [A
2025-05-07T20:26:06.8418208Z 
2025-05-07T20:26:06.8419409Z 
2025-05-07T20:26:06.8737743Z libcufft-11.3.0.4    | 156.2 MB  | #########2 |  93% [A[A
2025-05-07T20:26:06.9324162Z nsight-compute-2024. | 443.1 MB  | ###        |  31% 
2025-05-07T20:26:06.9326379Z 
2025-05-07T20:26:06.9420941Z libcublas-12.6.4.1   | 256.2 MB  | #####8     |  58% [A
2025-05-07T20:26:06.9421209Z 
2025-05-07T20:26:06.9422819Z 
2025-05-07T20:26:06.9740829Z libcufft-11.3.0.4    | 156.2 MB  | #########5 |  96% [A[A
2025-05-07T20:26:07.0325785Z nsight-compute-2024. | 443.1 MB  | ###1       |  32% 
2025-05-07T20:26:07.0326772Z 
2025-05-07T20:26:07.0421266Z libcublas-12.6.4.1   | 256.2 MB  | #####9     |  60% [A
2025-05-07T20:26:07.0421704Z 
2025-05-07T20:26:07.0421710Z 
2025-05-07T20:26:07.0742857Z libcufft-11.3.0.4    | 156.2 MB  | #########8 |  99% [A[A
2025-05-07T20:26:07.1327793Z nsight-compute-2024. | 443.1 MB  | ###2       |  33% 
2025-05-07T20:26:07.1328800Z 
2025-05-07T20:26:07.1744904Z libcublas-12.6.4.1   | 256.2 MB  | ######1    |  62% [A
2025-05-07T20:26:07.2328357Z nsight-compute-2024. | 443.1 MB  | ###4       |  34% 
2025-05-07T20:26:07.2328650Z 
2025-05-07T20:26:07.3196164Z libcublas-12.6.4.1   | 256.2 MB  | ######4    |  64% [A
2025-05-07T20:26:07.3330207Z nsight-compute-2024. | 443.1 MB  | ###5       |  35% 
2025-05-07T20:26:07.3330557Z 
2025-05-07T20:26:07.4289935Z libcublas-12.6.4.1   | 256.2 MB  | ######6    |  67% [A
2025-05-07T20:26:07.4333373Z nsight-compute-2024. | 443.1 MB  | ###6       |  36% 
2025-05-07T20:26:07.4334034Z 
2025-05-07T20:26:07.5334026Z libcublas-12.6.4.1   | 256.2 MB  | ######9    |  69% [A
2025-05-07T20:26:07.5337160Z 
2025-05-07T20:26:07.5411490Z libcublas-12.6.4.1   | 256.2 MB  | #######1   |  72% [A
2025-05-07T20:26:07.6411644Z nsight-compute-2024. | 443.1 MB  | ###6       |  37% 
2025-05-07T20:26:07.6418723Z nsight-compute-2024. | 443.1 MB  | ###8       |  38% 
2025-05-07T20:26:07.6419122Z 
2025-05-07T20:26:07.7413528Z libcublas-12.6.4.1   | 256.2 MB  | #######3   |  74% [A
2025-05-07T20:26:07.7469319Z nsight-compute-2024. | 443.1 MB  | ###9       |  39% 
2025-05-07T20:26:07.7469571Z 
2025-05-07T20:26:07.8472424Z libcublas-12.6.4.1   | 256.2 MB  | #######6   |  76% [A
2025-05-07T20:26:07.8472795Z 
2025-05-07T20:26:07.9475550Z libcublas-12.6.4.1   | 256.2 MB  | #######9   |  80% [A
2025-05-07T20:26:07.9475811Z 
2025-05-07T20:26:08.0014030Z libcublas-12.6.4.1   | 256.2 MB  | ########2  |  83% [A
2025-05-07T20:26:08.0781331Z nsight-compute-2024. | 443.1 MB  | ####       |  40% 
2025-05-07T20:26:08.0781772Z 
2025-05-07T20:26:08.1014091Z libcublas-12.6.4.1   | 256.2 MB  | ########5  |  85% [A
2025-05-07T20:26:08.1212592Z nsight-compute-2024. | 443.1 MB  | ####1      |  42% 
2025-05-07T20:26:08.1212852Z 
2025-05-07T20:26:08.1212856Z 
2025-05-07T20:26:08.1212860Z 
2025-05-07T20:26:08.1220191Z 
2025-05-07T20:26:08.1735218Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:26:08.1735527Z 
2025-05-07T20:26:08.1735531Z 
2025-05-07T20:26:08.1735535Z 
2025-05-07T20:26:08.1735539Z 
2025-05-07T20:26:08.1737471Z 
2025-05-07T20:26:08.2015279Z cuda-nvvp-12.6.80    | 109.3 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:26:08.2190343Z nsight-compute-2024. | 443.1 MB  | ####2      |  43% 
2025-05-07T20:26:08.2191810Z 
2025-05-07T20:26:08.2736193Z libcublas-12.6.4.1   | 256.2 MB  | ########7  |  88% [A
2025-05-07T20:26:08.2736467Z 
2025-05-07T20:26:08.2736471Z 
2025-05-07T20:26:08.2736506Z 
2025-05-07T20:26:08.2736522Z 
2025-05-07T20:26:08.2737234Z 
2025-05-07T20:26:08.3181409Z cuda-nvvp-12.6.80    | 109.3 MB  | 3          |   4% [A[A[A[A[A
2025-05-07T20:26:08.3713818Z nsight-compute-2024. | 443.1 MB  | ####3      |  43% 
2025-05-07T20:26:08.3716570Z 
2025-05-07T20:26:08.3736048Z libcublas-12.6.4.1   | 256.2 MB  | ########9  |  90% [A
2025-05-07T20:26:08.3736306Z 
2025-05-07T20:26:08.3736321Z 
2025-05-07T20:26:08.3736325Z 
2025-05-07T20:26:08.3736328Z 
2025-05-07T20:26:08.3737381Z 
2025-05-07T20:26:08.4291482Z cuda-nvvp-12.6.80    | 109.3 MB  | 7          |   7% [A[A[A[A[A
2025-05-07T20:26:08.4743315Z nsight-compute-2024. | 443.1 MB  | ####4      |  44% 
2025-05-07T20:26:08.4743589Z 
2025-05-07T20:26:08.4743593Z 
2025-05-07T20:26:08.4743596Z 
2025-05-07T20:26:08.4743601Z 
2025-05-07T20:26:08.4745688Z 
2025-05-07T20:26:08.5039356Z cuda-nvvp-12.6.80    | 109.3 MB  | #          |  11% [A[A[A[A[A
2025-05-07T20:26:08.5039647Z 
2025-05-07T20:26:08.5039651Z 
2025-05-07T20:26:08.5039684Z 
2025-05-07T20:26:08.5108589Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:26:08.5108873Z 
2025-05-07T20:26:08.5444705Z libcublas-12.6.4.1   | 256.2 MB  | #########2 |  92% [A
2025-05-07T20:26:08.5489694Z nsight-compute-2024. | 443.1 MB  | ####5      |  45% 
2025-05-07T20:26:08.5489960Z 
2025-05-07T20:26:08.5489964Z 
2025-05-07T20:26:08.5489968Z 
2025-05-07T20:26:08.5489971Z 
2025-05-07T20:26:08.5489975Z 
2025-05-07T20:26:08.5489979Z 
2025-05-07T20:26:08.5793277Z libcusolver-11.7.1.2 | 95.8 MB   |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:08.5793598Z 
2025-05-07T20:26:08.5793603Z 
2025-05-07T20:26:08.5793606Z 
2025-05-07T20:26:08.5793610Z 
2025-05-07T20:26:08.5793614Z 
2025-05-07T20:26:08.6499114Z cuda-nvvp-12.6.80    | 109.3 MB  | #3         |  14% [A[A[A[A[A
2025-05-07T20:26:08.6499430Z 
2025-05-07T20:26:08.6499434Z 
2025-05-07T20:26:08.6499438Z 
2025-05-07T20:26:08.6499442Z 
2025-05-07T20:26:08.6499450Z 
2025-05-07T20:26:08.6500102Z 
2025-05-07T20:26:08.6648080Z libcusolver-11.7.1.2 | 95.8 MB   | 2          |   3% [A[A[A[A[A[A
2025-05-07T20:26:08.6648402Z 
2025-05-07T20:26:08.6771229Z libcublas-12.6.4.1   | 256.2 MB  | #########3 |  94% [A
2025-05-07T20:26:08.6801128Z nsight-compute-2024. | 443.1 MB  | ####6      |  46% 
2025-05-07T20:26:08.6801387Z 
2025-05-07T20:26:08.6801391Z 
2025-05-07T20:26:08.6801395Z 
2025-05-07T20:26:08.6801399Z 
2025-05-07T20:26:08.6801411Z 
2025-05-07T20:26:08.7503596Z cuda-nvvp-12.6.80    | 109.3 MB  | #6         |  17% [A[A[A[A[A
2025-05-07T20:26:08.7503947Z 
2025-05-07T20:26:08.7503952Z 
2025-05-07T20:26:08.7503956Z 
2025-05-07T20:26:08.7503960Z 
2025-05-07T20:26:08.7503973Z 
2025-05-07T20:26:08.7503977Z 
2025-05-07T20:26:08.7952971Z libcusolver-11.7.1.2 | 95.8 MB   | 5          |   6% [A[A[A[A[A[A
2025-05-07T20:26:08.7953313Z 
2025-05-07T20:26:08.7953318Z 
2025-05-07T20:26:08.7953321Z 
2025-05-07T20:26:08.7953336Z 
2025-05-07T20:26:08.7953340Z 
2025-05-07T20:26:08.7975953Z cuda-nvvp-12.6.80    | 109.3 MB  | #9         |  20% [A[A[A[A[A
2025-05-07T20:26:08.8072713Z nsight-compute-2024. | 443.1 MB  | ####6      |  47% 
2025-05-07T20:26:08.8073097Z 
2025-05-07T20:26:08.8508199Z libcublas-12.6.4.1   | 256.2 MB  | #########5 |  96% [A
2025-05-07T20:26:08.8508633Z 
2025-05-07T20:26:08.8508640Z 
2025-05-07T20:26:08.8508646Z 
2025-05-07T20:26:08.8508652Z 
2025-05-07T20:26:08.8508658Z 
2025-05-07T20:26:08.8512816Z 
2025-05-07T20:26:08.8958819Z libcusolver-11.7.1.2 | 95.8 MB   | 8          |   9% [A[A[A[A[A[A
2025-05-07T20:26:08.8959146Z 
2025-05-07T20:26:08.8959151Z 
2025-05-07T20:26:08.8959154Z 
2025-05-07T20:26:08.8959158Z 
2025-05-07T20:26:08.8960188Z 
2025-05-07T20:26:08.9077305Z cuda-nvvp-12.6.80    | 109.3 MB  | ##2        |  23% [A[A[A[A[A
2025-05-07T20:26:08.9301798Z nsight-compute-2024. | 443.1 MB  | ####7      |  48% 
2025-05-07T20:26:08.9303304Z 
2025-05-07T20:26:08.9514656Z libcublas-12.6.4.1   | 256.2 MB  | #########7 |  97% [A
2025-05-07T20:26:08.9515031Z 
2025-05-07T20:26:08.9515067Z 
2025-05-07T20:26:08.9515089Z 
2025-05-07T20:26:08.9515094Z 
2025-05-07T20:26:08.9515100Z 
2025-05-07T20:26:08.9519522Z 
2025-05-07T20:26:08.9964413Z libcusolver-11.7.1.2 | 95.8 MB   | #2         |  12% [A[A[A[A[A[A
2025-05-07T20:26:08.9964750Z 
2025-05-07T20:26:08.9964754Z 
2025-05-07T20:26:08.9964758Z 
2025-05-07T20:26:08.9964762Z 
2025-05-07T20:26:08.9967305Z 
2025-05-07T20:26:09.0202810Z cuda-nvvp-12.6.80    | 109.3 MB  | ##5        |  26% [A[A[A[A[A
2025-05-07T20:26:09.0514737Z nsight-compute-2024. | 443.1 MB  | ####8      |  48% 
2025-05-07T20:26:09.0515099Z 
2025-05-07T20:26:09.0515104Z 
2025-05-07T20:26:09.0515107Z 
2025-05-07T20:26:09.0515111Z 
2025-05-07T20:26:09.0515115Z 
2025-05-07T20:26:09.0516900Z 
2025-05-07T20:26:09.0640689Z libcusolver-11.7.1.2 | 95.8 MB   | #5         |  15% [A[A[A[A[A[A
2025-05-07T20:26:09.0641679Z 
2025-05-07T20:26:09.1098732Z libcublas-12.6.4.1   | 256.2 MB  | #########8 |  99% [A
2025-05-07T20:26:09.1099126Z 
2025-05-07T20:26:09.1099132Z 
2025-05-07T20:26:09.1099475Z 
2025-05-07T20:26:09.1099479Z 
2025-05-07T20:26:09.1099483Z 
2025-05-07T20:26:09.1304246Z cuda-nvvp-12.6.80    | 109.3 MB  | ##8        |  29% [A[A[A[A[A
2025-05-07T20:26:09.1515139Z nsight-compute-2024. | 443.1 MB  | ####9      |  49% 
2025-05-07T20:26:09.1515512Z 
2025-05-07T20:26:09.1515518Z 
2025-05-07T20:26:09.1515524Z 
2025-05-07T20:26:09.1515529Z 
2025-05-07T20:26:09.1515534Z 
2025-05-07T20:26:09.1517010Z 
2025-05-07T20:26:09.1727423Z libcusolver-11.7.1.2 | 95.8 MB   | #8         |  18% [A[A[A[A[A[A
2025-05-07T20:26:09.1728921Z 
2025-05-07T20:26:09.2200999Z libcublas-12.6.4.1   | 256.2 MB  | #########9 | 100% [A
2025-05-07T20:26:09.2201354Z 
2025-05-07T20:26:09.2201359Z 
2025-05-07T20:26:09.2201375Z 
2025-05-07T20:26:09.2201381Z 
2025-05-07T20:26:09.2202801Z 
2025-05-07T20:26:09.2518069Z cuda-nvvp-12.6.80    | 109.3 MB  | ###1       |  31% [A[A[A[A[A
2025-05-07T20:26:09.2518463Z 
2025-05-07T20:26:09.2518469Z 
2025-05-07T20:26:09.2518474Z 
2025-05-07T20:26:09.2518511Z 
2025-05-07T20:26:09.2518533Z 
2025-05-07T20:26:09.2521596Z 
2025-05-07T20:26:09.2549057Z libcusolver-11.7.1.2 | 95.8 MB   | ##1        |  22% [A[A[A[A[A[A
2025-05-07T20:26:09.3202001Z nsight-compute-2024. | 443.1 MB  | ####9      |  50% 
2025-05-07T20:26:09.3202357Z 
2025-05-07T20:26:09.3202363Z 
2025-05-07T20:26:09.3202368Z 
2025-05-07T20:26:09.3202373Z 
2025-05-07T20:26:09.3202670Z 
2025-05-07T20:26:09.3552646Z cuda-nvvp-12.6.80    | 109.3 MB  | ###4       |  34% [A[A[A[A[A
2025-05-07T20:26:09.3562089Z nsight-compute-2024. | 443.1 MB  | #####      |  50% 
2025-05-07T20:26:09.3562450Z 
2025-05-07T20:26:09.3562456Z 
2025-05-07T20:26:09.3562461Z 
2025-05-07T20:26:09.3562466Z 
2025-05-07T20:26:09.3562471Z 
2025-05-07T20:26:09.3569379Z 
2025-05-07T20:26:09.4282420Z libcusolver-11.7.1.2 | 95.8 MB   | ##4        |  25% [A[A[A[A[A[A
2025-05-07T20:26:09.4282735Z 
2025-05-07T20:26:09.4282739Z 
2025-05-07T20:26:09.4282743Z 
2025-05-07T20:26:09.4282747Z 
2025-05-07T20:26:09.4285575Z 
2025-05-07T20:26:09.4670573Z cuda-nvvp-12.6.80    | 109.3 MB  | ###7       |  37% [A[A[A[A[A
2025-05-07T20:26:09.4670888Z 
2025-05-07T20:26:09.4670892Z 
2025-05-07T20:26:09.4670896Z 
2025-05-07T20:26:09.4670900Z 
2025-05-07T20:26:09.4670903Z 
2025-05-07T20:26:09.4670907Z 
2025-05-07T20:26:09.4679905Z libcusolver-11.7.1.2 | 95.8 MB   | ##8        |  28% [A[A[A[A[A[A
2025-05-07T20:26:09.5283858Z nsight-compute-2024. | 443.1 MB  | #####1     |  51% 
2025-05-07T20:26:09.5284132Z 
2025-05-07T20:26:09.5284136Z 
2025-05-07T20:26:09.5284140Z 
2025-05-07T20:26:09.5284144Z 
2025-05-07T20:26:09.5286757Z 
2025-05-07T20:26:09.5676937Z cuda-nvvp-12.6.80    | 109.3 MB  | ###9       |  40% [A[A[A[A[A
2025-05-07T20:26:09.5677234Z 
2025-05-07T20:26:09.5677238Z 
2025-05-07T20:26:09.5677242Z 
2025-05-07T20:26:09.5677246Z 
2025-05-07T20:26:09.5677250Z 
2025-05-07T20:26:09.5678385Z 
2025-05-07T20:26:09.5683373Z libcusolver-11.7.1.2 | 95.8 MB   | ###1       |  32% [A[A[A[A[A[A
2025-05-07T20:26:09.6286578Z nsight-compute-2024. | 443.1 MB  | #####1     |  52% 
2025-05-07T20:26:09.6286872Z 
2025-05-07T20:26:09.6286879Z 
2025-05-07T20:26:09.6286884Z 
2025-05-07T20:26:09.6286889Z 
2025-05-07T20:26:09.6288821Z 
2025-05-07T20:26:09.6680427Z cuda-nvvp-12.6.80    | 109.3 MB  | ####3      |  43% [A[A[A[A[A
2025-05-07T20:26:09.6680817Z 
2025-05-07T20:26:09.6680823Z 
2025-05-07T20:26:09.6680828Z 
2025-05-07T20:26:09.6680843Z 
2025-05-07T20:26:09.6680848Z 
2025-05-07T20:26:09.6680853Z 
2025-05-07T20:26:09.6697049Z libcusolver-11.7.1.2 | 95.8 MB   | ###4       |  35% [A[A[A[A[A[A
2025-05-07T20:26:09.7349677Z nsight-compute-2024. | 443.1 MB  | #####2     |  53% 
2025-05-07T20:26:09.7350027Z 
2025-05-07T20:26:09.7350032Z 
2025-05-07T20:26:09.7350038Z 
2025-05-07T20:26:09.7350042Z 
2025-05-07T20:26:09.7352035Z 
2025-05-07T20:26:09.7684853Z cuda-nvvp-12.6.80    | 109.3 MB  | ####6      |  46% [A[A[A[A[A
2025-05-07T20:26:09.7685152Z 
2025-05-07T20:26:09.7685156Z 
2025-05-07T20:26:09.7685160Z 
2025-05-07T20:26:09.7685163Z 
2025-05-07T20:26:09.7685440Z 
2025-05-07T20:26:09.7685444Z 
2025-05-07T20:26:09.7732645Z libcusolver-11.7.1.2 | 95.8 MB   | ###8       |  38% [A[A[A[A[A[A
2025-05-07T20:26:09.8350900Z nsight-compute-2024. | 443.1 MB  | #####3     |  53% 
2025-05-07T20:26:09.8351163Z 
2025-05-07T20:26:09.8351415Z 
2025-05-07T20:26:09.8351446Z 
2025-05-07T20:26:09.8351452Z 
2025-05-07T20:26:09.8352818Z 
2025-05-07T20:26:09.8687928Z cuda-nvvp-12.6.80    | 109.3 MB  | ####9      |  49% [A[A[A[A[A
2025-05-07T20:26:09.8688377Z 
2025-05-07T20:26:09.8688392Z 
2025-05-07T20:26:09.8688397Z 
2025-05-07T20:26:09.8688402Z 
2025-05-07T20:26:09.8688407Z 
2025-05-07T20:26:09.8688412Z 
2025-05-07T20:26:09.8738469Z libcusolver-11.7.1.2 | 95.8 MB   | ####1      |  42% [A[A[A[A[A[A
2025-05-07T20:26:09.9355322Z nsight-compute-2024. | 443.1 MB  | #####4     |  54% 
2025-05-07T20:26:09.9357409Z 
2025-05-07T20:26:09.9357416Z 
2025-05-07T20:26:09.9357421Z 
2025-05-07T20:26:09.9357426Z 
2025-05-07T20:26:09.9357431Z 
2025-05-07T20:26:09.9692593Z cuda-nvvp-12.6.80    | 109.3 MB  | #####2     |  52% [A[A[A[A[A
2025-05-07T20:26:09.9692987Z 
2025-05-07T20:26:09.9692993Z 
2025-05-07T20:26:09.9692998Z 
2025-05-07T20:26:09.9693003Z 
2025-05-07T20:26:09.9693008Z 
2025-05-07T20:26:09.9698372Z 
2025-05-07T20:26:09.9743730Z libcusolver-11.7.1.2 | 95.8 MB   | ####5      |  45% [A[A[A[A[A[A
2025-05-07T20:26:10.0359514Z nsight-compute-2024. | 443.1 MB  | #####4     |  55% 
2025-05-07T20:26:10.0359860Z 
2025-05-07T20:26:10.0359867Z 
2025-05-07T20:26:10.0359872Z 
2025-05-07T20:26:10.0359878Z 
2025-05-07T20:26:10.0365963Z 
2025-05-07T20:26:10.0693529Z cuda-nvvp-12.6.80    | 109.3 MB  | #####5     |  55% [A[A[A[A[A
2025-05-07T20:26:10.0693823Z 
2025-05-07T20:26:10.0693829Z 
2025-05-07T20:26:10.0693836Z 
2025-05-07T20:26:10.0693841Z 
2025-05-07T20:26:10.0693847Z 
2025-05-07T20:26:10.0693852Z 
2025-05-07T20:26:10.0748839Z libcusolver-11.7.1.2 | 95.8 MB   | ####8      |  49% [A[A[A[A[A[A
2025-05-07T20:26:10.1402253Z nsight-compute-2024. | 443.1 MB  | #####5     |  55% 
2025-05-07T20:26:10.1402681Z 
2025-05-07T20:26:10.1402688Z 
2025-05-07T20:26:10.1402696Z 
2025-05-07T20:26:10.1402702Z 
2025-05-07T20:26:10.1402708Z 
2025-05-07T20:26:10.1806164Z cuda-nvvp-12.6.80    | 109.3 MB  | #####8     |  58% [A[A[A[A[A
2025-05-07T20:26:10.1857392Z nsight-compute-2024. | 443.1 MB  | #####6     |  56% 
2025-05-07T20:26:10.1857766Z 
2025-05-07T20:26:10.1857771Z 
2025-05-07T20:26:10.1857776Z 
2025-05-07T20:26:10.1857780Z 
2025-05-07T20:26:10.1857785Z 
2025-05-07T20:26:10.1858502Z 
2025-05-07T20:26:10.2409800Z libcusolver-11.7.1.2 | 95.8 MB   | #####2     |  52% [A[A[A[A[A[A
2025-05-07T20:26:10.2410126Z 
2025-05-07T20:26:10.2410131Z 
2025-05-07T20:26:10.2410134Z 
2025-05-07T20:26:10.2410138Z 
2025-05-07T20:26:10.2410142Z 
2025-05-07T20:26:10.2807921Z cuda-nvvp-12.6.80    | 109.3 MB  | ######1    |  61% [A[A[A[A[A
2025-05-07T20:26:10.2857571Z nsight-compute-2024. | 443.1 MB  | #####7     |  57% 
2025-05-07T20:26:10.2857862Z 
2025-05-07T20:26:10.2858112Z 
2025-05-07T20:26:10.2858136Z 
2025-05-07T20:26:10.2858146Z 
2025-05-07T20:26:10.2858151Z 
2025-05-07T20:26:10.2858156Z 
2025-05-07T20:26:10.3412018Z libcusolver-11.7.1.2 | 95.8 MB   | #####5     |  56% [A[A[A[A[A[A
2025-05-07T20:26:10.3412339Z 
2025-05-07T20:26:10.3412345Z 
2025-05-07T20:26:10.3412349Z 
2025-05-07T20:26:10.3412353Z 
2025-05-07T20:26:10.3412644Z 
2025-05-07T20:26:10.3860671Z cuda-nvvp-12.6.80    | 109.3 MB  | ######4    |  65% [A[A[A[A[A
2025-05-07T20:26:10.3860986Z 
2025-05-07T20:26:10.3860992Z 
2025-05-07T20:26:10.3860998Z 
2025-05-07T20:26:10.3861003Z 
2025-05-07T20:26:10.3861008Z 
2025-05-07T20:26:10.3862327Z 
2025-05-07T20:26:10.3877804Z libcusolver-11.7.1.2 | 95.8 MB   | #####8     |  59% [A[A[A[A[A[A
2025-05-07T20:26:10.4455287Z nsight-compute-2024. | 443.1 MB  | #####7     |  58% 
2025-05-07T20:26:10.4455650Z 
2025-05-07T20:26:10.4455655Z 
2025-05-07T20:26:10.4455658Z 
2025-05-07T20:26:10.4455662Z 
2025-05-07T20:26:10.4458076Z 
2025-05-07T20:26:10.4543453Z cuda-nvvp-12.6.80    | 109.3 MB  | ######7    |  68% [A[A[A[A[A
2025-05-07T20:26:10.4544157Z 
2025-05-07T20:26:10.4546767Z 
2025-05-07T20:26:10.4923660Z libcufft-11.3.0.4    | 156.2 MB  | ########## | 100% [A[A
2025-05-07T20:26:10.4928524Z nsight-compute-2024. | 443.1 MB  | #####8     |  58% 
2025-05-07T20:26:10.4928853Z 
2025-05-07T20:26:10.4928859Z 
2025-05-07T20:26:10.4928865Z 
2025-05-07T20:26:10.4928870Z 
2025-05-07T20:26:10.4928875Z 
2025-05-07T20:26:10.4928881Z 
2025-05-07T20:26:10.4933352Z 
2025-05-07T20:26:10.5038219Z libnpp-12.3.1.54     | 93.4 MB   |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:26:10.5038505Z 
2025-05-07T20:26:10.5038509Z 
2025-05-07T20:26:10.5038514Z 
2025-05-07T20:26:10.5038517Z 
2025-05-07T20:26:10.5038521Z 
2025-05-07T20:26:10.5038537Z 
2025-05-07T20:26:10.5708642Z libcusolver-11.7.1.2 | 95.8 MB   | ######2    |  62% [A[A[A[A[A[A
2025-05-07T20:26:10.5708953Z 
2025-05-07T20:26:10.5708957Z 
2025-05-07T20:26:10.5708961Z 
2025-05-07T20:26:10.5708965Z 
2025-05-07T20:26:10.5708999Z 
2025-05-07T20:26:10.5932417Z cuda-nvvp-12.6.80    | 109.3 MB  | #######    |  71% [A[A[A[A[A
2025-05-07T20:26:10.5932707Z 
2025-05-07T20:26:10.5932711Z 
2025-05-07T20:26:10.5932715Z 
2025-05-07T20:26:10.5932726Z 
2025-05-07T20:26:10.5932730Z 
2025-05-07T20:26:10.5932734Z 
2025-05-07T20:26:10.5937104Z 
2025-05-07T20:26:10.6178810Z libnpp-12.3.1.54     | 93.4 MB   | 2          |   3% [A[A[A[A[A[A[A
2025-05-07T20:26:10.6233601Z nsight-compute-2024. | 443.1 MB  | #####9     |  59% 
2025-05-07T20:26:10.6233862Z 
2025-05-07T20:26:10.6233867Z 
2025-05-07T20:26:10.6233870Z 
2025-05-07T20:26:10.6233874Z 
2025-05-07T20:26:10.6233878Z 
2025-05-07T20:26:10.6233885Z 
2025-05-07T20:26:10.6877936Z libcusolver-11.7.1.2 | 95.8 MB   | ######5    |  65% [A[A[A[A[A[A
2025-05-07T20:26:10.6878255Z 
2025-05-07T20:26:10.6878259Z 
2025-05-07T20:26:10.6878264Z 
2025-05-07T20:26:10.6878267Z 
2025-05-07T20:26:10.6878271Z 
2025-05-07T20:26:10.6937369Z cuda-nvvp-12.6.80    | 109.3 MB  | #######3   |  73% [A[A[A[A[A
2025-05-07T20:26:10.6937680Z 
2025-05-07T20:26:10.6937684Z 
2025-05-07T20:26:10.6937688Z 
2025-05-07T20:26:10.6937692Z 
2025-05-07T20:26:10.6937696Z 
2025-05-07T20:26:10.6937714Z 
2025-05-07T20:26:10.6939372Z 
2025-05-07T20:26:10.7395957Z libnpp-12.3.1.54     | 93.4 MB   | 5          |   6% [A[A[A[A[A[A[A
2025-05-07T20:26:10.7396277Z 
2025-05-07T20:26:10.7396283Z 
2025-05-07T20:26:10.7396288Z 
2025-05-07T20:26:10.7396293Z 
2025-05-07T20:26:10.7396298Z 
2025-05-07T20:26:10.7399930Z 
2025-05-07T20:26:10.7483524Z libcusolver-11.7.1.2 | 95.8 MB   | ######8    |  68% [A[A[A[A[A[A
2025-05-07T20:26:10.7939657Z nsight-compute-2024. | 443.1 MB  | #####9     |  60% 
2025-05-07T20:26:10.7939956Z 
2025-05-07T20:26:10.7939963Z 
2025-05-07T20:26:10.7939969Z 
2025-05-07T20:26:10.7939975Z 
2025-05-07T20:26:10.7939981Z 
2025-05-07T20:26:10.7939987Z 
2025-05-07T20:26:10.7941515Z 
2025-05-07T20:26:10.8026597Z libnpp-12.3.1.54     | 93.4 MB   | 8          |   9% [A[A[A[A[A[A[A
2025-05-07T20:26:10.8026919Z 
2025-05-07T20:26:10.8026934Z 
2025-05-07T20:26:10.8026938Z 
2025-05-07T20:26:10.8026942Z 
2025-05-07T20:26:10.8030498Z 
2025-05-07T20:26:10.8439876Z cuda-nvvp-12.6.80    | 109.3 MB  | #######6   |  76% [A[A[A[A[A
2025-05-07T20:26:10.8440181Z 
2025-05-07T20:26:10.8440187Z 
2025-05-07T20:26:10.8440193Z 
2025-05-07T20:26:10.8440199Z 
2025-05-07T20:26:10.8440204Z 
2025-05-07T20:26:10.8440757Z 
2025-05-07T20:26:10.8546610Z libcusolver-11.7.1.2 | 95.8 MB   | #######1   |  71% [A[A[A[A[A[A
2025-05-07T20:26:10.8942796Z nsight-compute-2024. | 443.1 MB  | ######     |  60% 
2025-05-07T20:26:10.8943072Z 
2025-05-07T20:26:10.8943077Z 
2025-05-07T20:26:10.8943081Z 
2025-05-07T20:26:10.8943088Z 
2025-05-07T20:26:10.8943092Z 
2025-05-07T20:26:10.8943101Z 
2025-05-07T20:26:10.8943108Z 
2025-05-07T20:26:10.9065567Z libnpp-12.3.1.54     | 93.4 MB   | #1         |  12% [A[A[A[A[A[A[A
2025-05-07T20:26:10.9065859Z 
2025-05-07T20:26:10.9065863Z 
2025-05-07T20:26:10.9065867Z 
2025-05-07T20:26:10.9065897Z 
2025-05-07T20:26:10.9066176Z 
2025-05-07T20:26:10.9479393Z cuda-nvvp-12.6.80    | 109.3 MB  | #######8   |  79% [A[A[A[A[A
2025-05-07T20:26:10.9479772Z 
2025-05-07T20:26:10.9479776Z 
2025-05-07T20:26:10.9479780Z 
2025-05-07T20:26:10.9479783Z 
2025-05-07T20:26:10.9479787Z 
2025-05-07T20:26:10.9479791Z 
2025-05-07T20:26:10.9687690Z libcusolver-11.7.1.2 | 95.8 MB   | #######4   |  74% [A[A[A[A[A[A
2025-05-07T20:26:10.9943026Z nsight-compute-2024. | 443.1 MB  | ######1    |  61% 
2025-05-07T20:26:10.9943315Z 
2025-05-07T20:26:10.9943321Z 
2025-05-07T20:26:10.9943326Z 
2025-05-07T20:26:10.9943331Z 
2025-05-07T20:26:10.9943336Z 
2025-05-07T20:26:10.9943342Z 
2025-05-07T20:26:10.9943347Z 
2025-05-07T20:26:11.0175358Z libnpp-12.3.1.54     | 93.4 MB   | #4         |  14% [A[A[A[A[A[A[A
2025-05-07T20:26:11.0175729Z 
2025-05-07T20:26:11.0175734Z 
2025-05-07T20:26:11.0175738Z 
2025-05-07T20:26:11.0175742Z 
2025-05-07T20:26:11.0178601Z 
2025-05-07T20:26:11.0502389Z cuda-nvvp-12.6.80    | 109.3 MB  | ########1  |  81% [A[A[A[A[A
2025-05-07T20:26:11.0502692Z 
2025-05-07T20:26:11.0502697Z 
2025-05-07T20:26:11.0502701Z 
2025-05-07T20:26:11.0502704Z 
2025-05-07T20:26:11.0502716Z 
2025-05-07T20:26:11.0502720Z 
2025-05-07T20:26:11.0757792Z libcusolver-11.7.1.2 | 95.8 MB   | #######7   |  77% [A[A[A[A[A[A
2025-05-07T20:26:11.0949725Z nsight-compute-2024. | 443.1 MB  | ######1    |  62% 
2025-05-07T20:26:11.0949995Z 
2025-05-07T20:26:11.0949999Z 
2025-05-07T20:26:11.0950003Z 
2025-05-07T20:26:11.0950006Z 
2025-05-07T20:26:11.0950010Z 
2025-05-07T20:26:11.0950014Z 
2025-05-07T20:26:11.0952665Z 
2025-05-07T20:26:11.1295479Z libnpp-12.3.1.54     | 93.4 MB   | #7         |  18% [A[A[A[A[A[A[A
2025-05-07T20:26:11.1295940Z 
2025-05-07T20:26:11.1295946Z 
2025-05-07T20:26:11.1295951Z 
2025-05-07T20:26:11.1295956Z 
2025-05-07T20:26:11.1295962Z 
2025-05-07T20:26:11.1507677Z cuda-nvvp-12.6.80    | 109.3 MB  | ########3  |  84% [A[A[A[A[A
2025-05-07T20:26:11.1507980Z 
2025-05-07T20:26:11.1507984Z 
2025-05-07T20:26:11.1508249Z 
2025-05-07T20:26:11.1508255Z 
2025-05-07T20:26:11.1508258Z 
2025-05-07T20:26:11.1508985Z 
2025-05-07T20:26:11.1762232Z libcusolver-11.7.1.2 | 95.8 MB   | #######9   |  80% [A[A[A[A[A[A
2025-05-07T20:26:11.2051766Z nsight-compute-2024. | 443.1 MB  | ######2    |  62% 
2025-05-07T20:26:11.2052167Z 
2025-05-07T20:26:11.2052174Z 
2025-05-07T20:26:11.2052180Z 
2025-05-07T20:26:11.2052185Z 
2025-05-07T20:26:11.2052191Z 
2025-05-07T20:26:11.2052197Z 
2025-05-07T20:26:11.2054875Z 
2025-05-07T20:26:11.2378788Z libnpp-12.3.1.54     | 93.4 MB   | ##         |  21% [A[A[A[A[A[A[A
2025-05-07T20:26:11.2379086Z 
2025-05-07T20:26:11.2379090Z 
2025-05-07T20:26:11.2379093Z 
2025-05-07T20:26:11.2379097Z 
2025-05-07T20:26:11.2380239Z 
2025-05-07T20:26:11.2507448Z cuda-nvvp-12.6.80    | 109.3 MB  | ########6  |  86% [A[A[A[A[A
2025-05-07T20:26:11.2507743Z 
2025-05-07T20:26:11.2507747Z 
2025-05-07T20:26:11.2507751Z 
2025-05-07T20:26:11.2507755Z 
2025-05-07T20:26:11.2507758Z 
2025-05-07T20:26:11.2509944Z 
2025-05-07T20:26:11.2766036Z libcusolver-11.7.1.2 | 95.8 MB   | ########3  |  83% [A[A[A[A[A[A
2025-05-07T20:26:11.3244427Z nsight-compute-2024. | 443.1 MB  | ######2    |  63% 
2025-05-07T20:26:11.3244704Z 
2025-05-07T20:26:11.3244709Z 
2025-05-07T20:26:11.3244712Z 
2025-05-07T20:26:11.3244716Z 
2025-05-07T20:26:11.3244720Z 
2025-05-07T20:26:11.3244724Z 
2025-05-07T20:26:11.3246472Z 
2025-05-07T20:26:11.3383078Z libnpp-12.3.1.54     | 93.4 MB   | ##3        |  23% [A[A[A[A[A[A[A
2025-05-07T20:26:11.3383368Z 
2025-05-07T20:26:11.3383372Z 
2025-05-07T20:26:11.3383376Z 
2025-05-07T20:26:11.3383380Z 
2025-05-07T20:26:11.3383384Z 
2025-05-07T20:26:11.3510062Z cuda-nvvp-12.6.80    | 109.3 MB  | ########8  |  89% [A[A[A[A[A
2025-05-07T20:26:11.3510350Z 
2025-05-07T20:26:11.3510354Z 
2025-05-07T20:26:11.3510359Z 
2025-05-07T20:26:11.3510363Z 
2025-05-07T20:26:11.3510366Z 
2025-05-07T20:26:11.3510378Z 
2025-05-07T20:26:11.3769357Z libcusolver-11.7.1.2 | 95.8 MB   | ########6  |  86% [A[A[A[A[A[A
2025-05-07T20:26:11.4342730Z nsight-compute-2024. | 443.1 MB  | ######3    |  64% 
2025-05-07T20:26:11.4343011Z 
2025-05-07T20:26:11.4343015Z 
2025-05-07T20:26:11.4343019Z 
2025-05-07T20:26:11.4343023Z 
2025-05-07T20:26:11.4343026Z 
2025-05-07T20:26:11.4343030Z 
2025-05-07T20:26:11.4347361Z 
2025-05-07T20:26:11.4448070Z libnpp-12.3.1.54     | 93.4 MB   | ##6        |  26% [A[A[A[A[A[A[A
2025-05-07T20:26:11.4448360Z 
2025-05-07T20:26:11.4448365Z 
2025-05-07T20:26:11.4448369Z 
2025-05-07T20:26:11.4448372Z 
2025-05-07T20:26:11.4450457Z 
2025-05-07T20:26:11.4511467Z cuda-nvvp-12.6.80    | 109.3 MB  | #########1 |  91% [A[A[A[A[A
2025-05-07T20:26:11.4511857Z 
2025-05-07T20:26:11.4511861Z 
2025-05-07T20:26:11.4511865Z 
2025-05-07T20:26:11.4511869Z 
2025-05-07T20:26:11.4511873Z 
2025-05-07T20:26:11.4513260Z 
2025-05-07T20:26:11.4927199Z libcusolver-11.7.1.2 | 95.8 MB   | ########9  |  89% [A[A[A[A[A[A
2025-05-07T20:26:11.5348665Z nsight-compute-2024. | 443.1 MB  | ######4    |  64% 
2025-05-07T20:26:11.5348947Z 
2025-05-07T20:26:11.5348951Z 
2025-05-07T20:26:11.5348955Z 
2025-05-07T20:26:11.5348959Z 
2025-05-07T20:26:11.5348963Z 
2025-05-07T20:26:11.5348966Z 
2025-05-07T20:26:11.5350869Z 
2025-05-07T20:26:11.5476660Z libnpp-12.3.1.54     | 93.4 MB   | ##8        |  29% [A[A[A[A[A[A[A
2025-05-07T20:26:11.5477002Z 
2025-05-07T20:26:11.5477007Z 
2025-05-07T20:26:11.5477011Z 
2025-05-07T20:26:11.5477022Z 
2025-05-07T20:26:11.5477026Z 
2025-05-07T20:26:11.5594083Z cuda-nvvp-12.6.80    | 109.3 MB  | #########3 |  93% [A[A[A[A[A
2025-05-07T20:26:11.5594368Z 
2025-05-07T20:26:11.5594372Z 
2025-05-07T20:26:11.5594376Z 
2025-05-07T20:26:11.5594387Z 
2025-05-07T20:26:11.5594391Z 
2025-05-07T20:26:11.5602014Z 
2025-05-07T20:26:11.6045002Z libcusolver-11.7.1.2 | 95.8 MB   | #########2 |  92% [A[A[A[A[A[A
2025-05-07T20:26:11.6350412Z nsight-compute-2024. | 443.1 MB  | ######4    |  65% 
2025-05-07T20:26:11.6350745Z 
2025-05-07T20:26:11.6350749Z 
2025-05-07T20:26:11.6351014Z 
2025-05-07T20:26:11.6351038Z 
2025-05-07T20:26:11.6351044Z 
2025-05-07T20:26:11.6351049Z 
2025-05-07T20:26:11.6352320Z 
2025-05-07T20:26:11.6479682Z libnpp-12.3.1.54     | 93.4 MB   | ###1       |  32% [A[A[A[A[A[A[A
2025-05-07T20:26:11.6480038Z 
2025-05-07T20:26:11.6480043Z 
2025-05-07T20:26:11.6480046Z 
2025-05-07T20:26:11.6480059Z 
2025-05-07T20:26:11.6483491Z 
2025-05-07T20:26:11.6627519Z cuda-nvvp-12.6.80    | 109.3 MB  | #########5 |  96% [A[A[A[A[A
2025-05-07T20:26:11.6627806Z 
2025-05-07T20:26:11.6627811Z 
2025-05-07T20:26:11.6627821Z 
2025-05-07T20:26:11.6627825Z 
2025-05-07T20:26:11.6627829Z 
2025-05-07T20:26:11.6635747Z 
2025-05-07T20:26:11.7047291Z libcusolver-11.7.1.2 | 95.8 MB   | #########5 |  95% [A[A[A[A[A[A
2025-05-07T20:26:11.7353460Z nsight-compute-2024. | 443.1 MB  | ######5    |  65% 
2025-05-07T20:26:11.7353729Z 
2025-05-07T20:26:11.7353733Z 
2025-05-07T20:26:11.7353737Z 
2025-05-07T20:26:11.7353741Z 
2025-05-07T20:26:11.7353745Z 
2025-05-07T20:26:11.7353749Z 
2025-05-07T20:26:11.7357938Z 
2025-05-07T20:26:11.7483629Z libnpp-12.3.1.54     | 93.4 MB   | ###4       |  34% [A[A[A[A[A[A[A
2025-05-07T20:26:11.7483942Z 
2025-05-07T20:26:11.7483948Z 
2025-05-07T20:26:11.7483954Z 
2025-05-07T20:26:11.7483960Z 
2025-05-07T20:26:11.7483966Z 
2025-05-07T20:26:11.7629225Z cuda-nvvp-12.6.80    | 109.3 MB  | #########8 |  99% [A[A[A[A[A
2025-05-07T20:26:11.7629546Z 
2025-05-07T20:26:11.7629553Z 
2025-05-07T20:26:11.7629558Z 
2025-05-07T20:26:11.7629563Z 
2025-05-07T20:26:11.7629569Z 
2025-05-07T20:26:11.7631960Z 
2025-05-07T20:26:11.8054911Z libcusolver-11.7.1.2 | 95.8 MB   | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:26:11.8354079Z nsight-compute-2024. | 443.1 MB  | ######6    |  66% 
2025-05-07T20:26:11.8354374Z 
2025-05-07T20:26:11.8354378Z 
2025-05-07T20:26:11.8354381Z 
2025-05-07T20:26:11.8354385Z 
2025-05-07T20:26:11.8354389Z 
2025-05-07T20:26:11.8354393Z 
2025-05-07T20:26:11.8354397Z 
2025-05-07T20:26:11.9056014Z libnpp-12.3.1.54     | 93.4 MB   | ###7       |  37% [A[A[A[A[A[A[A
2025-05-07T20:26:11.9114943Z nsight-compute-2024. | 443.1 MB  | ######6    |  67% 
2025-05-07T20:26:11.9115205Z 
2025-05-07T20:26:11.9115249Z 
2025-05-07T20:26:11.9115253Z 
2025-05-07T20:26:11.9115412Z 
2025-05-07T20:26:11.9357358Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:26:11.9357676Z 
2025-05-07T20:26:11.9357682Z 
2025-05-07T20:26:11.9357688Z 
2025-05-07T20:26:11.9357693Z 
2025-05-07T20:26:11.9357699Z 
2025-05-07T20:26:11.9357704Z 
2025-05-07T20:26:11.9358123Z 
2025-05-07T20:26:12.0060869Z libnpp-12.3.1.54     | 93.4 MB   | ####       |  40% [A[A[A[A[A[A[A
2025-05-07T20:26:12.0362557Z nsight-compute-2024. | 443.1 MB  | ######7    |  68% 
2025-05-07T20:26:12.0362935Z 
2025-05-07T20:26:12.0362941Z 
2025-05-07T20:26:12.0362947Z 
2025-05-07T20:26:12.0362953Z 
2025-05-07T20:26:12.0362974Z 
2025-05-07T20:26:12.0362979Z 
2025-05-07T20:26:12.0362985Z 
2025-05-07T20:26:12.1065080Z libnpp-12.3.1.54     | 93.4 MB   | ####4      |  44% [A[A[A[A[A[A[A
2025-05-07T20:26:12.1365777Z nsight-compute-2024. | 443.1 MB  | ######8    |  68% 
2025-05-07T20:26:12.1366046Z 
2025-05-07T20:26:12.1366051Z 
2025-05-07T20:26:12.1366055Z 
2025-05-07T20:26:12.1366059Z 
2025-05-07T20:26:12.1366094Z 
2025-05-07T20:26:12.1366098Z 
2025-05-07T20:26:12.1366161Z 
2025-05-07T20:26:12.2065177Z libnpp-12.3.1.54     | 93.4 MB   | ####8      |  49% [A[A[A[A[A[A[A
2025-05-07T20:26:12.2367544Z nsight-compute-2024. | 443.1 MB  | ######9    |  69% 
2025-05-07T20:26:12.2367933Z 
2025-05-07T20:26:12.2367939Z 
2025-05-07T20:26:12.2367946Z 
2025-05-07T20:26:12.2367953Z 
2025-05-07T20:26:12.2367959Z 
2025-05-07T20:26:12.2367968Z 
2025-05-07T20:26:12.2371273Z 
2025-05-07T20:26:12.3069922Z libnpp-12.3.1.54     | 93.4 MB   | #####2     |  53% [A[A[A[A[A[A[A
2025-05-07T20:26:12.3370139Z nsight-compute-2024. | 443.1 MB  | ######9    |  70% 
2025-05-07T20:26:12.3370540Z 
2025-05-07T20:26:12.3370547Z 
2025-05-07T20:26:12.3370552Z 
2025-05-07T20:26:12.3370557Z 
2025-05-07T20:26:12.3370857Z 
2025-05-07T20:26:12.3370881Z 
2025-05-07T20:26:12.3373027Z 
2025-05-07T20:26:12.4073774Z libnpp-12.3.1.54     | 93.4 MB   | #####6     |  57% [A[A[A[A[A[A[A
2025-05-07T20:26:12.4376150Z nsight-compute-2024. | 443.1 MB  | #######    |  71% 
2025-05-07T20:26:12.4376555Z 
2025-05-07T20:26:12.4376560Z 
2025-05-07T20:26:12.4376564Z 
2025-05-07T20:26:12.4376567Z 
2025-05-07T20:26:12.4376571Z 
2025-05-07T20:26:12.4376575Z 
2025-05-07T20:26:12.4379700Z 
2025-05-07T20:26:12.5079934Z libnpp-12.3.1.54     | 93.4 MB   | ######1    |  61% [A[A[A[A[A[A[A
2025-05-07T20:26:12.5381896Z nsight-compute-2024. | 443.1 MB  | #######1   |  72% 
2025-05-07T20:26:12.5382232Z 
2025-05-07T20:26:12.5382238Z 
2025-05-07T20:26:12.5382245Z 
2025-05-07T20:26:12.5382250Z 
2025-05-07T20:26:12.5382255Z 
2025-05-07T20:26:12.5382262Z 
2025-05-07T20:26:12.5382271Z 
2025-05-07T20:26:12.6161388Z libnpp-12.3.1.54     | 93.4 MB   | ######4    |  65% [A[A[A[A[A[A[A
2025-05-07T20:26:12.6387442Z nsight-compute-2024. | 443.1 MB  | #######2   |  72% 
2025-05-07T20:26:12.6387837Z 
2025-05-07T20:26:12.6387843Z 
2025-05-07T20:26:12.6387848Z 
2025-05-07T20:26:12.6387854Z 
2025-05-07T20:26:12.6387859Z 
2025-05-07T20:26:12.6387865Z 
2025-05-07T20:26:12.6390850Z 
2025-05-07T20:26:12.7165926Z libnpp-12.3.1.54     | 93.4 MB   | ######9    |  69% [A[A[A[A[A[A[A
2025-05-07T20:26:12.7516177Z nsight-compute-2024. | 443.1 MB  | #######3   |  73% 
2025-05-07T20:26:12.7516582Z 
2025-05-07T20:26:12.7516589Z 
2025-05-07T20:26:12.7516594Z 
2025-05-07T20:26:12.7516600Z 
2025-05-07T20:26:12.7516605Z 
2025-05-07T20:26:12.7516611Z 
2025-05-07T20:26:12.7516617Z 
2025-05-07T20:26:12.8167880Z libnpp-12.3.1.54     | 93.4 MB   | #######3   |  73% [A[A[A[A[A[A[A
2025-05-07T20:26:12.8536304Z nsight-compute-2024. | 443.1 MB  | #######4   |  74% 
2025-05-07T20:26:12.8536600Z 
2025-05-07T20:26:12.8536604Z 
2025-05-07T20:26:12.8536607Z 
2025-05-07T20:26:12.8536611Z 
2025-05-07T20:26:12.8536615Z 
2025-05-07T20:26:12.8536619Z 
2025-05-07T20:26:12.8538024Z 
2025-05-07T20:26:12.9223640Z libnpp-12.3.1.54     | 93.4 MB   | #######6   |  77% [A[A[A[A[A[A[A
2025-05-07T20:26:12.9692048Z nsight-compute-2024. | 443.1 MB  | #######4   |  75% 
2025-05-07T20:26:12.9692434Z 
2025-05-07T20:26:12.9692439Z 
2025-05-07T20:26:12.9692442Z 
2025-05-07T20:26:12.9692446Z 
2025-05-07T20:26:12.9692450Z 
2025-05-07T20:26:12.9692454Z 
2025-05-07T20:26:12.9693719Z 
2025-05-07T20:26:13.0224442Z libnpp-12.3.1.54     | 93.4 MB   | ########   |  81% [A[A[A[A[A[A[A
2025-05-07T20:26:13.0758778Z nsight-compute-2024. | 443.1 MB  | #######5   |  76% 
2025-05-07T20:26:13.0759063Z 
2025-05-07T20:26:13.0759067Z 
2025-05-07T20:26:13.0759070Z 
2025-05-07T20:26:13.0759074Z 
2025-05-07T20:26:13.0759078Z 
2025-05-07T20:26:13.0759082Z 
2025-05-07T20:26:13.0759086Z 
2025-05-07T20:26:13.1226333Z libnpp-12.3.1.54     | 93.4 MB   | ########4  |  84% [A[A[A[A[A[A[A
2025-05-07T20:26:13.1769860Z nsight-compute-2024. | 443.1 MB  | #######6   |  77% 
2025-05-07T20:26:13.1770281Z 
2025-05-07T20:26:13.1770313Z 
2025-05-07T20:26:13.1770319Z 
2025-05-07T20:26:13.1770324Z 
2025-05-07T20:26:13.1770329Z 
2025-05-07T20:26:13.1770334Z 
2025-05-07T20:26:13.1770382Z 
2025-05-07T20:26:13.2247104Z libnpp-12.3.1.54     | 93.4 MB   | ########7  |  88% [A[A[A[A[A[A[A
2025-05-07T20:26:13.2774118Z nsight-compute-2024. | 443.1 MB  | #######7   |  77% 
2025-05-07T20:26:13.2774393Z 
2025-05-07T20:26:13.2774397Z 
2025-05-07T20:26:13.2774401Z 
2025-05-07T20:26:13.2774405Z 
2025-05-07T20:26:13.2774424Z 
2025-05-07T20:26:13.2774429Z 
2025-05-07T20:26:13.2775125Z 
2025-05-07T20:26:13.3250415Z libnpp-12.3.1.54     | 93.4 MB   | #########1 |  92% [A[A[A[A[A[A[A
2025-05-07T20:26:13.3778566Z nsight-compute-2024. | 443.1 MB  | #######8   |  78% 
2025-05-07T20:26:13.3778879Z 
2025-05-07T20:26:13.3778884Z 
2025-05-07T20:26:13.3778889Z 
2025-05-07T20:26:13.3778894Z 
2025-05-07T20:26:13.3778898Z 
2025-05-07T20:26:13.3778904Z 
2025-05-07T20:26:13.3779474Z 
2025-05-07T20:26:13.4357031Z libnpp-12.3.1.54     | 93.4 MB   | #########5 |  96% [A[A[A[A[A[A[A
2025-05-07T20:26:13.4779633Z nsight-compute-2024. | 443.1 MB  | #######9   |  79% 
2025-05-07T20:26:13.4779903Z 
2025-05-07T20:26:13.4779908Z 
2025-05-07T20:26:13.4779911Z 
2025-05-07T20:26:13.4779915Z 
2025-05-07T20:26:13.4779919Z 
2025-05-07T20:26:13.4779923Z 
2025-05-07T20:26:13.4782326Z 
2025-05-07T20:26:13.5356975Z libnpp-12.3.1.54     | 93.4 MB   | #########9 | 100% [A[A[A[A[A[A[A
2025-05-07T20:26:13.6728766Z nsight-compute-2024. | 443.1 MB  | ########   |  80% 
2025-05-07T20:26:13.8523103Z nsight-compute-2024. | 443.1 MB  | ########   |  81% 
2025-05-07T20:26:13.9523263Z nsight-compute-2024. | 443.1 MB  | ########1  |  82% 
2025-05-07T20:26:14.0528367Z nsight-compute-2024. | 443.1 MB  | ########2  |  83% 
2025-05-07T20:26:14.1531679Z nsight-compute-2024. | 443.1 MB  | ########3  |  84% 
2025-05-07T20:26:14.2535994Z nsight-compute-2024. | 443.1 MB  | ########4  |  84% 
2025-05-07T20:26:14.3536108Z nsight-compute-2024. | 443.1 MB  | ########5  |  85% 
2025-05-07T20:26:14.4575253Z nsight-compute-2024. | 443.1 MB  | ########6  |  86% 
2025-05-07T20:26:14.5575671Z nsight-compute-2024. | 443.1 MB  | ########7  |  87% 
2025-05-07T20:26:14.6153183Z nsight-compute-2024. | 443.1 MB  | ########8  |  88% 
2025-05-07T20:26:14.6153463Z 
2025-05-07T20:26:14.6153467Z 
2025-05-07T20:26:14.6153471Z 
2025-05-07T20:26:14.6153475Z 
2025-05-07T20:26:14.6153480Z 
2025-05-07T20:26:14.6153485Z 
2025-05-07T20:26:14.6586065Z libcusolver-11.7.1.2 | 95.8 MB   | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:14.6648453Z nsight-compute-2024. | 443.1 MB  | ########8  |  89% 
2025-05-07T20:26:14.6648729Z 
2025-05-07T20:26:14.6649071Z 
2025-05-07T20:26:14.6649075Z 
2025-05-07T20:26:14.6649175Z 
2025-05-07T20:26:14.6649218Z 
2025-05-07T20:26:14.6649234Z 
2025-05-07T20:26:14.6649240Z 
2025-05-07T20:26:14.6649415Z 
2025-05-07T20:26:14.7652795Z cuda-nvdisasm-12.6.7 | 47.6 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:26:14.7653228Z 
2025-05-07T20:26:14.7653262Z 
2025-05-07T20:26:14.7653513Z 
2025-05-07T20:26:14.7653517Z 
2025-05-07T20:26:14.7653533Z 
2025-05-07T20:26:14.7653536Z 
2025-05-07T20:26:14.7653540Z 
2025-05-07T20:26:14.7657677Z 
2025-05-07T20:26:14.7867763Z cuda-nvdisasm-12.6.7 | 47.6 MB   | 6          |   6% [A[A[A[A[A[A[A[A
2025-05-07T20:26:14.8702539Z nsight-compute-2024. | 443.1 MB  | ########9  |  90% 
2025-05-07T20:26:14.8702874Z 
2025-05-07T20:26:14.8703108Z 
2025-05-07T20:26:14.8703118Z 
2025-05-07T20:26:14.8703124Z 
2025-05-07T20:26:14.8703129Z 
2025-05-07T20:26:14.8703135Z 
2025-05-07T20:26:14.8703141Z 
2025-05-07T20:26:14.8709021Z 
2025-05-07T20:26:14.9021868Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #2         |  12% [A[A[A[A[A[A[A[A
2025-05-07T20:26:14.9707903Z nsight-compute-2024. | 443.1 MB  | #########  |  91% 
2025-05-07T20:26:14.9708325Z 
2025-05-07T20:26:14.9708332Z 
2025-05-07T20:26:14.9708337Z 
2025-05-07T20:26:14.9708342Z 
2025-05-07T20:26:14.9708348Z 
2025-05-07T20:26:14.9708354Z 
2025-05-07T20:26:14.9708359Z 
2025-05-07T20:26:14.9714711Z 
2025-05-07T20:26:15.0089681Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #9         |  19% [A[A[A[A[A[A[A[A
2025-05-07T20:26:15.0793370Z nsight-compute-2024. | 443.1 MB  | #########1 |  91% 
2025-05-07T20:26:15.0793816Z 
2025-05-07T20:26:15.0793822Z 
2025-05-07T20:26:15.0793827Z 
2025-05-07T20:26:15.0793844Z 
2025-05-07T20:26:15.0793849Z 
2025-05-07T20:26:15.0793854Z 
2025-05-07T20:26:15.0793860Z 
2025-05-07T20:26:15.0798874Z 
2025-05-07T20:26:15.0852932Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ##5        |  25% [A[A[A[A[A[A[A[A
2025-05-07T20:26:15.0853343Z 
2025-05-07T20:26:15.0853348Z 
2025-05-07T20:26:15.0853353Z 
2025-05-07T20:26:15.0853358Z 
2025-05-07T20:26:15.0858165Z 
2025-05-07T20:26:15.1387822Z cuda-nvvp-12.6.80    | 109.3 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:15.1388126Z 
2025-05-07T20:26:15.1388130Z 
2025-05-07T20:26:15.1388134Z 
2025-05-07T20:26:15.1388138Z 
2025-05-07T20:26:15.1388142Z 
2025-05-07T20:26:15.1388146Z 
2025-05-07T20:26:15.1388150Z 
2025-05-07T20:26:15.1388398Z 
2025-05-07T20:26:15.1388585Z 
2025-05-07T20:26:15.1403155Z libcurand-10.3.7.77  | 39.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.1793413Z nsight-compute-2024. | 443.1 MB  | #########2 |  92% 
2025-05-07T20:26:15.1793707Z 
2025-05-07T20:26:15.1793711Z 
2025-05-07T20:26:15.1793715Z 
2025-05-07T20:26:15.1793718Z 
2025-05-07T20:26:15.1793722Z 
2025-05-07T20:26:15.1793726Z 
2025-05-07T20:26:15.1793729Z 
2025-05-07T20:26:15.1793733Z 
2025-05-07T20:26:15.2390682Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ###2       |  33% [A[A[A[A[A[A[A[A
2025-05-07T20:26:15.2391016Z 
2025-05-07T20:26:15.2391020Z 
2025-05-07T20:26:15.2391024Z 
2025-05-07T20:26:15.2391028Z 
2025-05-07T20:26:15.2391033Z 
2025-05-07T20:26:15.2391037Z 
2025-05-07T20:26:15.2391042Z 
2025-05-07T20:26:15.2391046Z 
2025-05-07T20:26:15.2398835Z 
2025-05-07T20:26:15.2408377Z libcurand-10.3.7.77  | 39.9 MB   | 3          |   4% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.2797187Z nsight-compute-2024. | 443.1 MB  | #########2 |  93% 
2025-05-07T20:26:15.2797484Z 
2025-05-07T20:26:15.2797488Z 
2025-05-07T20:26:15.2797491Z 
2025-05-07T20:26:15.2797495Z 
2025-05-07T20:26:15.2797499Z 
2025-05-07T20:26:15.2797503Z 
2025-05-07T20:26:15.2797506Z 
2025-05-07T20:26:15.2797510Z 
2025-05-07T20:26:15.3404480Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ###9       |  40% [A[A[A[A[A[A[A[A
2025-05-07T20:26:15.3404943Z 
2025-05-07T20:26:15.3404952Z 
2025-05-07T20:26:15.3404960Z 
2025-05-07T20:26:15.3404968Z 
2025-05-07T20:26:15.3404976Z 
2025-05-07T20:26:15.3404984Z 
2025-05-07T20:26:15.3404992Z 
2025-05-07T20:26:15.3405000Z 
2025-05-07T20:26:15.3407810Z 
2025-05-07T20:26:15.3496186Z libcurand-10.3.7.77  | 39.9 MB   | 9          |  10% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.3984278Z nsight-compute-2024. | 443.1 MB  | #########3 |  94% 
2025-05-07T20:26:15.3984555Z 
2025-05-07T20:26:15.3984560Z 
2025-05-07T20:26:15.3984565Z 
2025-05-07T20:26:15.3984568Z 
2025-05-07T20:26:15.3984573Z 
2025-05-07T20:26:15.3984578Z 
2025-05-07T20:26:15.3984891Z 
2025-05-07T20:26:15.3984898Z 
2025-05-07T20:26:15.4410918Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ####6      |  46% [A[A[A[A[A[A[A[A
2025-05-07T20:26:15.4411236Z 
2025-05-07T20:26:15.4411240Z 
2025-05-07T20:26:15.4411244Z 
2025-05-07T20:26:15.4411248Z 
2025-05-07T20:26:15.4411251Z 
2025-05-07T20:26:15.4411255Z 
2025-05-07T20:26:15.4411259Z 
2025-05-07T20:26:15.4411263Z 
2025-05-07T20:26:15.4411273Z 
2025-05-07T20:26:15.4615953Z libcurand-10.3.7.77  | 39.9 MB   | #6         |  16% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.5142070Z nsight-compute-2024. | 443.1 MB  | #########4 |  94% 
2025-05-07T20:26:15.5142341Z 
2025-05-07T20:26:15.5142621Z 
2025-05-07T20:26:15.5142633Z 
2025-05-07T20:26:15.5142687Z 
2025-05-07T20:26:15.5142692Z 
2025-05-07T20:26:15.5142697Z 
2025-05-07T20:26:15.5142703Z 
2025-05-07T20:26:15.5144209Z 
2025-05-07T20:26:15.5411234Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #####2     |  53% [A[A[A[A[A[A[A[A
2025-05-07T20:26:15.5411625Z 
2025-05-07T20:26:15.5411652Z 
2025-05-07T20:26:15.5411672Z 
2025-05-07T20:26:15.5411677Z 
2025-05-07T20:26:15.5411683Z 
2025-05-07T20:26:15.5411696Z 
2025-05-07T20:26:15.5411701Z 
2025-05-07T20:26:15.5411705Z 
2025-05-07T20:26:15.5413724Z 
2025-05-07T20:26:15.5696103Z libcurand-10.3.7.77  | 39.9 MB   | ##3        |  23% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.6238065Z nsight-compute-2024. | 443.1 MB  | #########5 |  95% 
2025-05-07T20:26:15.6238332Z 
2025-05-07T20:26:15.6238337Z 
2025-05-07T20:26:15.6238341Z 
2025-05-07T20:26:15.6238347Z 
2025-05-07T20:26:15.6238350Z 
2025-05-07T20:26:15.6238354Z 
2025-05-07T20:26:15.6238366Z 
2025-05-07T20:26:15.6239873Z 
2025-05-07T20:26:15.6416491Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #####8     |  59% [A[A[A[A[A[A[A[A
2025-05-07T20:26:15.6416811Z 
2025-05-07T20:26:15.6416816Z 
2025-05-07T20:26:15.6416827Z 
2025-05-07T20:26:15.6416831Z 
2025-05-07T20:26:15.6416834Z 
2025-05-07T20:26:15.6416839Z 
2025-05-07T20:26:15.6416842Z 
2025-05-07T20:26:15.6416847Z 
2025-05-07T20:26:15.6420142Z 
2025-05-07T20:26:15.6699766Z libcurand-10.3.7.77  | 39.9 MB   | ##9        |  30% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.7245636Z nsight-compute-2024. | 443.1 MB  | #########5 |  96% 
2025-05-07T20:26:15.7245932Z 
2025-05-07T20:26:15.7245936Z 
2025-05-07T20:26:15.7245940Z 
2025-05-07T20:26:15.7245944Z 
2025-05-07T20:26:15.7245947Z 
2025-05-07T20:26:15.7245951Z 
2025-05-07T20:26:15.7245955Z 
2025-05-07T20:26:15.7251826Z 
2025-05-07T20:26:15.7422440Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ######4    |  65% [A[A[A[A[A[A[A[A
2025-05-07T20:26:15.7422842Z 
2025-05-07T20:26:15.7422848Z 
2025-05-07T20:26:15.7422853Z 
2025-05-07T20:26:15.7422858Z 
2025-05-07T20:26:15.7422863Z 
2025-05-07T20:26:15.7422868Z 
2025-05-07T20:26:15.7422875Z 
2025-05-07T20:26:15.7422881Z 
2025-05-07T20:26:15.7424377Z 
2025-05-07T20:26:15.7753840Z libcurand-10.3.7.77  | 39.9 MB   | ###6       |  36% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.8246927Z nsight-compute-2024. | 443.1 MB  | #########6 |  96% 
2025-05-07T20:26:15.8247333Z 
2025-05-07T20:26:15.8247354Z 
2025-05-07T20:26:15.8247359Z 
2025-05-07T20:26:15.8247364Z 
2025-05-07T20:26:15.8247369Z 
2025-05-07T20:26:15.8247375Z 
2025-05-07T20:26:15.8247380Z 
2025-05-07T20:26:15.8250191Z 
2025-05-07T20:26:15.8428072Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #######1   |  71% [A[A[A[A[A[A[A[A
2025-05-07T20:26:15.8428522Z 
2025-05-07T20:26:15.8428527Z 
2025-05-07T20:26:15.8428532Z 
2025-05-07T20:26:15.8428537Z 
2025-05-07T20:26:15.8428543Z 
2025-05-07T20:26:15.8428548Z 
2025-05-07T20:26:15.8428564Z 
2025-05-07T20:26:15.8428571Z 
2025-05-07T20:26:15.8429917Z 
2025-05-07T20:26:15.8757497Z libcurand-10.3.7.77  | 39.9 MB   | ####3      |  43% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.9379979Z nsight-compute-2024. | 443.1 MB  | #########7 |  97% 
2025-05-07T20:26:15.9380433Z 
2025-05-07T20:26:15.9380440Z 
2025-05-07T20:26:15.9380446Z 
2025-05-07T20:26:15.9380451Z 
2025-05-07T20:26:15.9380458Z 
2025-05-07T20:26:15.9380464Z 
2025-05-07T20:26:15.9380470Z 
2025-05-07T20:26:15.9385263Z 
2025-05-07T20:26:15.9430124Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #######7   |  77% [A[A[A[A[A[A[A[A
2025-05-07T20:26:15.9430513Z 
2025-05-07T20:26:15.9430518Z 
2025-05-07T20:26:15.9430521Z 
2025-05-07T20:26:15.9430525Z 
2025-05-07T20:26:15.9430529Z 
2025-05-07T20:26:15.9430541Z 
2025-05-07T20:26:15.9430545Z 
2025-05-07T20:26:15.9430548Z 
2025-05-07T20:26:15.9431573Z 
2025-05-07T20:26:15.9776892Z libcurand-10.3.7.77  | 39.9 MB   | #####      |  51% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.0389823Z nsight-compute-2024. | 443.1 MB  | #########7 |  98% 
2025-05-07T20:26:16.0390169Z 
2025-05-07T20:26:16.0390176Z 
2025-05-07T20:26:16.0390181Z 
2025-05-07T20:26:16.0390187Z 
2025-05-07T20:26:16.0390192Z 
2025-05-07T20:26:16.0390209Z 
2025-05-07T20:26:16.0390215Z 
2025-05-07T20:26:16.0391763Z 
2025-05-07T20:26:16.0432676Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########4  |  84% [A[A[A[A[A[A[A[A
2025-05-07T20:26:16.0433136Z 
2025-05-07T20:26:16.0433141Z 
2025-05-07T20:26:16.0433145Z 
2025-05-07T20:26:16.0433167Z 
2025-05-07T20:26:16.0433183Z 
2025-05-07T20:26:16.0433186Z 
2025-05-07T20:26:16.0433190Z 
2025-05-07T20:26:16.0433193Z 
2025-05-07T20:26:16.0433197Z 
2025-05-07T20:26:16.0784085Z libcurand-10.3.7.77  | 39.9 MB   | #####7     |  58% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.1434665Z nsight-compute-2024. | 443.1 MB  | #########8 |  99% 
2025-05-07T20:26:16.1434940Z 
2025-05-07T20:26:16.1435333Z 
2025-05-07T20:26:16.1435339Z 
2025-05-07T20:26:16.1435357Z 
2025-05-07T20:26:16.1435360Z 
2025-05-07T20:26:16.1435364Z 
2025-05-07T20:26:16.1435368Z 
2025-05-07T20:26:16.1435371Z 
2025-05-07T20:26:16.1436924Z 
2025-05-07T20:26:16.1457284Z libcurand-10.3.7.77  | 39.9 MB   | ######5    |  65% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.1457603Z 
2025-05-07T20:26:16.1457608Z 
2025-05-07T20:26:16.1457612Z 
2025-05-07T20:26:16.1457616Z 
2025-05-07T20:26:16.1457620Z 
2025-05-07T20:26:16.1457625Z 
2025-05-07T20:26:16.1457630Z 
2025-05-07T20:26:16.1461957Z 
2025-05-07T20:26:16.1848065Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #########  |  90% [A[A[A[A[A[A[A[A
2025-05-07T20:26:16.2436758Z nsight-compute-2024. | 443.1 MB  | #########9 |  99% 
2025-05-07T20:26:16.2437096Z 
2025-05-07T20:26:16.2437102Z 
2025-05-07T20:26:16.2437119Z 
2025-05-07T20:26:16.2437124Z 
2025-05-07T20:26:16.2437129Z 
2025-05-07T20:26:16.2437135Z 
2025-05-07T20:26:16.2437140Z 
2025-05-07T20:26:16.2437145Z 
2025-05-07T20:26:16.2442205Z 
2025-05-07T20:26:16.2484418Z libcurand-10.3.7.77  | 39.9 MB   | #######2   |  73% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.2484874Z 
2025-05-07T20:26:16.2484880Z 
2025-05-07T20:26:16.2484885Z 
2025-05-07T20:26:16.2484891Z 
2025-05-07T20:26:16.2484897Z 
2025-05-07T20:26:16.2484902Z 
2025-05-07T20:26:16.2484908Z 
2025-05-07T20:26:16.2484913Z 
2025-05-07T20:26:16.3438484Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #########6 |  96% [A[A[A[A[A[A[A[A
2025-05-07T20:26:16.3438858Z 
2025-05-07T20:26:16.3438862Z 
2025-05-07T20:26:16.3438866Z 
2025-05-07T20:26:16.3438870Z 
2025-05-07T20:26:16.3438899Z 
2025-05-07T20:26:16.3438918Z 
2025-05-07T20:26:16.3438921Z 
2025-05-07T20:26:16.3438925Z 
2025-05-07T20:26:16.3440931Z 
2025-05-07T20:26:16.3723930Z libcurand-10.3.7.77  | 39.9 MB   | ########   |  81% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.3724241Z 
2025-05-07T20:26:16.3724245Z 
2025-05-07T20:26:16.3724249Z 
2025-05-07T20:26:16.3724252Z 
2025-05-07T20:26:16.3724256Z 
2025-05-07T20:26:16.3724260Z 
2025-05-07T20:26:16.3726547Z 
2025-05-07T20:26:16.4159886Z libnpp-12.3.1.54     | 93.4 MB   | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:26:16.4160180Z 
2025-05-07T20:26:16.4160184Z 
2025-05-07T20:26:16.4160188Z 
2025-05-07T20:26:16.4160200Z 
2025-05-07T20:26:16.4160204Z 
2025-05-07T20:26:16.4160208Z 
2025-05-07T20:26:16.4160211Z 
2025-05-07T20:26:16.4160215Z 
2025-05-07T20:26:16.4160219Z 
2025-05-07T20:26:16.4160389Z 
2025-05-07T20:26:16.4439256Z gds-tools-1.11.1.6   | 37.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.4439615Z 
2025-05-07T20:26:16.4439620Z 
2025-05-07T20:26:16.4439905Z 
2025-05-07T20:26:16.4439911Z 
2025-05-07T20:26:16.4439916Z 
2025-05-07T20:26:16.4439921Z 
2025-05-07T20:26:16.4439926Z 
2025-05-07T20:26:16.4439931Z 
2025-05-07T20:26:16.4442685Z 
2025-05-07T20:26:16.5167192Z libcurand-10.3.7.77  | 39.9 MB   | ########9  |  90% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.5167521Z 
2025-05-07T20:26:16.5167525Z 
2025-05-07T20:26:16.5167529Z 
2025-05-07T20:26:16.5167533Z 
2025-05-07T20:26:16.5167537Z 
2025-05-07T20:26:16.5167541Z 
2025-05-07T20:26:16.5167544Z 
2025-05-07T20:26:16.5167548Z 
2025-05-07T20:26:16.5167552Z 
2025-05-07T20:26:16.5167775Z 
2025-05-07T20:26:16.5439495Z gds-tools-1.11.1.6   | 37.8 MB   | 8          |   9% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.5439804Z 
2025-05-07T20:26:16.5439809Z 
2025-05-07T20:26:16.5439813Z 
2025-05-07T20:26:16.5439818Z 
2025-05-07T20:26:16.5439822Z 
2025-05-07T20:26:16.5439826Z 
2025-05-07T20:26:16.5439830Z 
2025-05-07T20:26:16.5439833Z 
2025-05-07T20:26:16.5443277Z 
2025-05-07T20:26:16.6167970Z libcurand-10.3.7.77  | 39.9 MB   | #########8 |  98% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.6168297Z 
2025-05-07T20:26:16.6168301Z 
2025-05-07T20:26:16.6168313Z 
2025-05-07T20:26:16.6168317Z 
2025-05-07T20:26:16.6168320Z 
2025-05-07T20:26:16.6168324Z 
2025-05-07T20:26:16.6168327Z 
2025-05-07T20:26:16.6168331Z 
2025-05-07T20:26:16.6168335Z 
2025-05-07T20:26:16.6168737Z 
2025-05-07T20:26:16.7168669Z gds-tools-1.11.1.6   | 37.8 MB   | #8         |  19% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.7169111Z 
2025-05-07T20:26:16.7169117Z 
2025-05-07T20:26:16.7169122Z 
2025-05-07T20:26:16.7169127Z 
2025-05-07T20:26:16.7169133Z 
2025-05-07T20:26:16.7169138Z 
2025-05-07T20:26:16.7169143Z 
2025-05-07T20:26:16.7169148Z 
2025-05-07T20:26:16.7169154Z 
2025-05-07T20:26:16.7171474Z 
2025-05-07T20:26:16.8174375Z gds-tools-1.11.1.6   | 37.8 MB   | ##9        |  29% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.8174780Z 
2025-05-07T20:26:16.8174785Z 
2025-05-07T20:26:16.8174790Z 
2025-05-07T20:26:16.8175060Z 
2025-05-07T20:26:16.8175080Z 
2025-05-07T20:26:16.8175084Z 
2025-05-07T20:26:16.8175087Z 
2025-05-07T20:26:16.8175091Z 
2025-05-07T20:26:16.8175095Z 
2025-05-07T20:26:16.8176809Z 
2025-05-07T20:26:16.8316397Z gds-tools-1.11.1.6   | 37.8 MB   | ###9       |  40% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.8316740Z 
2025-05-07T20:26:16.8316744Z 
2025-05-07T20:26:16.8316747Z 
2025-05-07T20:26:16.9174997Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:26:16.9175388Z 
2025-05-07T20:26:16.9175392Z 
2025-05-07T20:26:16.9175396Z 
2025-05-07T20:26:16.9175400Z 
2025-05-07T20:26:16.9175404Z 
2025-05-07T20:26:16.9175407Z 
2025-05-07T20:26:16.9175411Z 
2025-05-07T20:26:16.9175424Z 
2025-05-07T20:26:16.9175427Z 
2025-05-07T20:26:16.9175431Z 
2025-05-07T20:26:17.0230982Z gds-tools-1.11.1.6   | 37.8 MB   | ####9      |  50% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:17.0231300Z 
2025-05-07T20:26:17.0231312Z 
2025-05-07T20:26:17.0231315Z 
2025-05-07T20:26:17.0231319Z 
2025-05-07T20:26:17.0231344Z 
2025-05-07T20:26:17.0231361Z 
2025-05-07T20:26:17.0231364Z 
2025-05-07T20:26:17.0231368Z 
2025-05-07T20:26:17.0231371Z 
2025-05-07T20:26:17.0231375Z 
2025-05-07T20:26:17.1235000Z gds-tools-1.11.1.6   | 37.8 MB   | ######     |  60% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:17.1235316Z 
2025-05-07T20:26:17.1235320Z 
2025-05-07T20:26:17.1235332Z 
2025-05-07T20:26:17.1235336Z 
2025-05-07T20:26:17.1235339Z 
2025-05-07T20:26:17.1235343Z 
2025-05-07T20:26:17.1235347Z 
2025-05-07T20:26:17.1235350Z 
2025-05-07T20:26:17.1235354Z 
2025-05-07T20:26:17.1235357Z 
2025-05-07T20:26:17.2274379Z gds-tools-1.11.1.6   | 37.8 MB   | #######    |  71% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:17.2274680Z 
2025-05-07T20:26:17.2274695Z 
2025-05-07T20:26:17.2274699Z 
2025-05-07T20:26:17.2274702Z 
2025-05-07T20:26:17.2274706Z 
2025-05-07T20:26:17.2274710Z 
2025-05-07T20:26:17.2274714Z 
2025-05-07T20:26:17.2274718Z 
2025-05-07T20:26:17.2274722Z 
2025-05-07T20:26:17.2276105Z 
2025-05-07T20:26:17.3795950Z gds-tools-1.11.1.6   | 37.8 MB   | ########   |  81% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:17.3796876Z 
2025-05-07T20:26:17.4190352Z libcublas-12.6.4.1   | 256.2 MB  | ########## | 100% [A
2025-05-07T20:26:17.4190653Z 
2025-05-07T20:26:17.4190657Z 
2025-05-07T20:26:17.4190661Z 
2025-05-07T20:26:17.4190665Z 
2025-05-07T20:26:17.4190669Z 
2025-05-07T20:26:17.4190672Z 
2025-05-07T20:26:17.4190676Z 
2025-05-07T20:26:17.4190680Z 
2025-05-07T20:26:17.4190684Z 
2025-05-07T20:26:17.4190692Z 
2025-05-07T20:26:17.4305541Z gds-tools-1.11.1.6   | 37.8 MB   | #########1 |  91% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:17.4305978Z 
2025-05-07T20:26:17.4305984Z 
2025-05-07T20:26:17.4305989Z 
2025-05-07T20:26:17.4305994Z 
2025-05-07T20:26:17.4305999Z 
2025-05-07T20:26:17.4306005Z 
2025-05-07T20:26:17.4306010Z 
2025-05-07T20:26:17.4306016Z 
2025-05-07T20:26:17.4306021Z 
2025-05-07T20:26:17.4306026Z 
2025-05-07T20:26:17.4309796Z 
2025-05-07T20:26:17.5197266Z cuda-nvcc-tools-12.6 | 23.0 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:17.5197729Z 
2025-05-07T20:26:17.5197733Z 
2025-05-07T20:26:17.5197737Z 
2025-05-07T20:26:17.5197741Z 
2025-05-07T20:26:17.5197744Z 
2025-05-07T20:26:17.5197748Z 
2025-05-07T20:26:17.5197752Z 
2025-05-07T20:26:17.5197755Z 
2025-05-07T20:26:17.5197759Z 
2025-05-07T20:26:17.5203612Z 
2025-05-07T20:26:17.5308956Z gds-tools-1.11.1.6   | 37.8 MB   | #########9 | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:17.5309388Z 
2025-05-07T20:26:17.5309392Z 
2025-05-07T20:26:17.5309396Z 
2025-05-07T20:26:17.5309400Z 
2025-05-07T20:26:17.5309403Z 
2025-05-07T20:26:17.5309407Z 
2025-05-07T20:26:17.5309411Z 
2025-05-07T20:26:17.5309414Z 
2025-05-07T20:26:17.5309418Z 
2025-05-07T20:26:17.5309422Z 
2025-05-07T20:26:17.5309426Z 
2025-05-07T20:26:17.6311817Z cuda-nvcc-tools-12.6 | 23.0 MB   | #4         |  15% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:17.6312156Z 
2025-05-07T20:26:17.6312160Z 
2025-05-07T20:26:17.6312164Z 
2025-05-07T20:26:17.6312168Z 
2025-05-07T20:26:17.6312443Z 
2025-05-07T20:26:17.6312459Z 
2025-05-07T20:26:17.6312463Z 
2025-05-07T20:26:17.6312468Z 
2025-05-07T20:26:17.6312471Z 
2025-05-07T20:26:17.6312475Z 
2025-05-07T20:26:17.6312530Z 
2025-05-07T20:26:17.7338053Z cuda-nvcc-tools-12.6 | 23.0 MB   | ###2       |  33% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:17.7338398Z 
2025-05-07T20:26:17.7338402Z 
2025-05-07T20:26:17.7338415Z 
2025-05-07T20:26:17.7338419Z 
2025-05-07T20:26:17.7338423Z 
2025-05-07T20:26:17.7338426Z 
2025-05-07T20:26:17.7338430Z 
2025-05-07T20:26:17.7338434Z 
2025-05-07T20:26:17.7338438Z 
2025-05-07T20:26:17.7338442Z 
2025-05-07T20:26:17.7339002Z 
2025-05-07T20:26:17.8339345Z cuda-nvcc-tools-12.6 | 23.0 MB   | ####9      |  50% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:17.8339793Z 
2025-05-07T20:26:17.8339799Z 
2025-05-07T20:26:17.8339804Z 
2025-05-07T20:26:17.8339809Z 
2025-05-07T20:26:17.8339815Z 
2025-05-07T20:26:17.8339822Z 
2025-05-07T20:26:17.8339829Z 
2025-05-07T20:26:17.8339833Z 
2025-05-07T20:26:17.8339866Z 
2025-05-07T20:26:17.8339885Z 
2025-05-07T20:26:17.8339890Z 
2025-05-07T20:26:17.8661125Z cuda-nvcc-tools-12.6 | 23.0 MB   | ######7    |  67% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:17.8661463Z 
2025-05-07T20:26:17.8661467Z 
2025-05-07T20:26:17.8661471Z 
2025-05-07T20:26:17.8661475Z 
2025-05-07T20:26:17.8661479Z 
2025-05-07T20:26:17.8661482Z 
2025-05-07T20:26:17.8661486Z 
2025-05-07T20:26:17.8661490Z 
2025-05-07T20:26:17.8662943Z 
2025-05-07T20:26:17.8912891Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:17.8913207Z 
2025-05-07T20:26:17.8913211Z 
2025-05-07T20:26:17.8913215Z 
2025-05-07T20:26:17.8913219Z 
2025-05-07T20:26:17.8913222Z 
2025-05-07T20:26:17.8913226Z 
2025-05-07T20:26:17.8913239Z 
2025-05-07T20:26:17.8913242Z 
2025-05-07T20:26:17.9315941Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:26:17.9316384Z 
2025-05-07T20:26:17.9316390Z 
2025-05-07T20:26:17.9316407Z 
2025-05-07T20:26:17.9316442Z 
2025-05-07T20:26:17.9316696Z 
2025-05-07T20:26:17.9316700Z 
2025-05-07T20:26:17.9316703Z 
2025-05-07T20:26:17.9316707Z 
2025-05-07T20:26:17.9316711Z 
2025-05-07T20:26:17.9316714Z 
2025-05-07T20:26:17.9316718Z 
2025-05-07T20:26:17.9320140Z 
2025-05-07T20:26:17.9338939Z cuda-nvrtc-12.6.85   | 17.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:17.9339367Z 
2025-05-07T20:26:17.9339371Z 
2025-05-07T20:26:17.9339375Z 
2025-05-07T20:26:17.9339379Z 
2025-05-07T20:26:17.9339383Z 
2025-05-07T20:26:17.9339386Z 
2025-05-07T20:26:17.9339390Z 
2025-05-07T20:26:17.9339394Z 
2025-05-07T20:26:17.9339398Z 
2025-05-07T20:26:17.9339402Z 
2025-05-07T20:26:17.9340100Z 
2025-05-07T20:26:17.9517025Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########7  |  87% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:17.9517351Z 
2025-05-07T20:26:17.9517355Z 
2025-05-07T20:26:17.9517359Z 
2025-05-07T20:26:17.9517363Z 
2025-05-07T20:26:17.9517367Z 
2025-05-07T20:26:17.9517370Z 
2025-05-07T20:26:17.9517393Z 
2025-05-07T20:26:17.9517404Z 
2025-05-07T20:26:17.9517408Z 
2025-05-07T20:26:17.9517418Z 
2025-05-07T20:26:17.9517422Z 
2025-05-07T20:26:17.9517425Z 
2025-05-07T20:26:17.9520545Z 
2025-05-07T20:26:18.0317518Z libnvjitlink-12.6.85 | 14.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.0317960Z 
2025-05-07T20:26:18.0317966Z 
2025-05-07T20:26:18.0317971Z 
2025-05-07T20:26:18.0317976Z 
2025-05-07T20:26:18.0317991Z 
2025-05-07T20:26:18.0317996Z 
2025-05-07T20:26:18.0318001Z 
2025-05-07T20:26:18.0318006Z 
2025-05-07T20:26:18.0318011Z 
2025-05-07T20:26:18.0318016Z 
2025-05-07T20:26:18.0318022Z 
2025-05-07T20:26:18.0319642Z 
2025-05-07T20:26:18.0520469Z cuda-nvrtc-12.6.85   | 17.3 MB   | #8         |  19% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.0520873Z 
2025-05-07T20:26:18.0520879Z 
2025-05-07T20:26:18.0520884Z 
2025-05-07T20:26:18.0520889Z 
2025-05-07T20:26:18.0520895Z 
2025-05-07T20:26:18.0520900Z 
2025-05-07T20:26:18.0520905Z 
2025-05-07T20:26:18.0520913Z 
2025-05-07T20:26:18.0521193Z 
2025-05-07T20:26:18.0521201Z 
2025-05-07T20:26:18.0521206Z 
2025-05-07T20:26:18.0521211Z 
2025-05-07T20:26:18.0523126Z 
2025-05-07T20:26:18.1318232Z libnvjitlink-12.6.85 | 14.9 MB   | #9         |  19% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.1318609Z 
2025-05-07T20:26:18.1318617Z 
2025-05-07T20:26:18.1318622Z 
2025-05-07T20:26:18.1318628Z 
2025-05-07T20:26:18.1318635Z 
2025-05-07T20:26:18.1318641Z 
2025-05-07T20:26:18.1318647Z 
2025-05-07T20:26:18.1318653Z 
2025-05-07T20:26:18.1318658Z 
2025-05-07T20:26:18.1318663Z 
2025-05-07T20:26:18.1318668Z 
2025-05-07T20:26:18.1321534Z 
2025-05-07T20:26:18.1521182Z cuda-nvrtc-12.6.85   | 17.3 MB   | ###8       |  39% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.1521649Z 
2025-05-07T20:26:18.1521655Z 
2025-05-07T20:26:18.1521660Z 
2025-05-07T20:26:18.1521666Z 
2025-05-07T20:26:18.1521671Z 
2025-05-07T20:26:18.1521676Z 
2025-05-07T20:26:18.1521680Z 
2025-05-07T20:26:18.1521684Z 
2025-05-07T20:26:18.1521688Z 
2025-05-07T20:26:18.1521745Z 
2025-05-07T20:26:18.1521750Z 
2025-05-07T20:26:18.1521756Z 
2025-05-07T20:26:18.1521762Z 
2025-05-07T20:26:18.2320214Z libnvjitlink-12.6.85 | 14.9 MB   | ####1      |  41% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.2320574Z 
2025-05-07T20:26:18.2320578Z 
2025-05-07T20:26:18.2320582Z 
2025-05-07T20:26:18.2320585Z 
2025-05-07T20:26:18.2320589Z 
2025-05-07T20:26:18.2320593Z 
2025-05-07T20:26:18.2320597Z 
2025-05-07T20:26:18.2320601Z 
2025-05-07T20:26:18.2320605Z 
2025-05-07T20:26:18.2320609Z 
2025-05-07T20:26:18.2320613Z 
2025-05-07T20:26:18.2322117Z 
2025-05-07T20:26:18.2559917Z cuda-nvrtc-12.6.85   | 17.3 MB   | #####9     |  60% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.2560242Z 
2025-05-07T20:26:18.2560248Z 
2025-05-07T20:26:18.2560254Z 
2025-05-07T20:26:18.2560259Z 
2025-05-07T20:26:18.2560264Z 
2025-05-07T20:26:18.2560270Z 
2025-05-07T20:26:18.2560274Z 
2025-05-07T20:26:18.2560279Z 
2025-05-07T20:26:18.2560284Z 
2025-05-07T20:26:18.2560288Z 
2025-05-07T20:26:18.2560326Z 
2025-05-07T20:26:18.2560622Z 
2025-05-07T20:26:18.2560645Z 
2025-05-07T20:26:18.2590902Z libnvjitlink-12.6.85 | 14.9 MB   | ######2    |  62% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.2591219Z 
2025-05-07T20:26:18.2591224Z 
2025-05-07T20:26:18.3425044Z libcufft-11.3.0.4    | 156.2 MB  | ########## | 100% [A[A
2025-05-07T20:26:18.3425311Z 
2025-05-07T20:26:18.3425547Z 
2025-05-07T20:26:18.3425559Z 
2025-05-07T20:26:18.3429480Z 
2025-05-07T20:26:18.3429486Z 
2025-05-07T20:26:18.3429492Z 
2025-05-07T20:26:18.3429498Z 
2025-05-07T20:26:18.3429504Z 
2025-05-07T20:26:18.3429554Z 
2025-05-07T20:26:18.3429560Z 
2025-05-07T20:26:18.3429566Z 
2025-05-07T20:26:18.3429571Z 
2025-05-07T20:26:18.3600729Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########   |  80% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.3601105Z 
2025-05-07T20:26:18.3601111Z 
2025-05-07T20:26:18.3601117Z 
2025-05-07T20:26:18.3601124Z 
2025-05-07T20:26:18.3601130Z 
2025-05-07T20:26:18.3601135Z 
2025-05-07T20:26:18.3601178Z 
2025-05-07T20:26:18.3601206Z 
2025-05-07T20:26:18.3601211Z 
2025-05-07T20:26:18.3601217Z 
2025-05-07T20:26:18.3601223Z 
2025-05-07T20:26:18.3601227Z 
2025-05-07T20:26:18.3601232Z 
2025-05-07T20:26:18.4443786Z libnvjitlink-12.6.85 | 14.9 MB   | ########2  |  83% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.4444147Z 
2025-05-07T20:26:18.4444151Z 
2025-05-07T20:26:18.4444156Z 
2025-05-07T20:26:18.4444159Z 
2025-05-07T20:26:18.4444163Z 
2025-05-07T20:26:18.4444168Z 
2025-05-07T20:26:18.4444171Z 
2025-05-07T20:26:18.4444175Z 
2025-05-07T20:26:18.4444179Z 
2025-05-07T20:26:18.4444182Z 
2025-05-07T20:26:18.4444186Z 
2025-05-07T20:26:18.4444419Z 
2025-05-07T20:26:18.7513975Z cuda-nvrtc-12.6.85   | 17.3 MB   | #########9 | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.7514412Z 
2025-05-07T20:26:18.7514418Z 
2025-05-07T20:26:18.7514423Z 
2025-05-07T20:26:18.7514428Z 
2025-05-07T20:26:18.7514433Z 
2025-05-07T20:26:18.7514450Z 
2025-05-07T20:26:18.7514457Z 
2025-05-07T20:26:18.7514774Z 
2025-05-07T20:26:18.7514796Z 
2025-05-07T20:26:18.7514802Z 
2025-05-07T20:26:18.7517555Z 
2025-05-07T20:26:18.7942126Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.7942463Z 
2025-05-07T20:26:18.7942468Z 
2025-05-07T20:26:18.7942474Z 
2025-05-07T20:26:18.7942480Z 
2025-05-07T20:26:18.7942485Z 
2025-05-07T20:26:18.7942495Z 
2025-05-07T20:26:18.7942503Z 
2025-05-07T20:26:18.7942512Z 
2025-05-07T20:26:18.7942520Z 
2025-05-07T20:26:18.7942528Z 
2025-05-07T20:26:18.8063053Z gds-tools-1.11.1.6   | 37.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.8063377Z 
2025-05-07T20:26:18.8063383Z 
2025-05-07T20:26:18.8063389Z 
2025-05-07T20:26:18.8063394Z 
2025-05-07T20:26:18.8063408Z 
2025-05-07T20:26:18.8063412Z 
2025-05-07T20:26:18.8063418Z 
2025-05-07T20:26:18.8063423Z 
2025-05-07T20:26:18.8063428Z 
2025-05-07T20:26:18.8063434Z 
2025-05-07T20:26:18.8063438Z 
2025-05-07T20:26:18.8063443Z 
2025-05-07T20:26:18.8063475Z 
2025-05-07T20:26:18.8063494Z 
2025-05-07T20:26:18.8430527Z cuda-nvcc-dev_linux- | 10.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.8430990Z 
2025-05-07T20:26:18.8430997Z 
2025-05-07T20:26:18.8431002Z 
2025-05-07T20:26:18.8431008Z 
2025-05-07T20:26:18.8431014Z 
2025-05-07T20:26:18.8431019Z 
2025-05-07T20:26:18.8431025Z 
2025-05-07T20:26:18.8431031Z 
2025-05-07T20:26:18.8431036Z 
2025-05-07T20:26:18.8431042Z 
2025-05-07T20:26:18.8431047Z 
2025-05-07T20:26:18.8431052Z 
2025-05-07T20:26:18.8431057Z 
2025-05-07T20:26:18.8431060Z 
2025-05-07T20:26:18.8436572Z 
2025-05-07T20:26:18.8969699Z cuda-nvvm-tools-12.6 | 10.4 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.8970035Z 
2025-05-07T20:26:18.8970039Z 
2025-05-07T20:26:18.8970043Z 
2025-05-07T20:26:18.8970047Z 
2025-05-07T20:26:18.8970051Z 
2025-05-07T20:26:18.8970055Z 
2025-05-07T20:26:18.8970058Z 
2025-05-07T20:26:18.8970062Z 
2025-05-07T20:26:18.8970066Z 
2025-05-07T20:26:18.8970103Z 
2025-05-07T20:26:18.8970641Z 
2025-05-07T20:26:18.8970645Z 
2025-05-07T20:26:18.8973614Z 
2025-05-07T20:26:18.9068649Z libnvjitlink-12.6.85 | 14.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.9069130Z 
2025-05-07T20:26:18.9069134Z 
2025-05-07T20:26:18.9069138Z 
2025-05-07T20:26:18.9069141Z 
2025-05-07T20:26:18.9069145Z 
2025-05-07T20:26:18.9069149Z 
2025-05-07T20:26:18.9069153Z 
2025-05-07T20:26:18.9069156Z 
2025-05-07T20:26:18.9069160Z 
2025-05-07T20:26:18.9069164Z 
2025-05-07T20:26:18.9069167Z 
2025-05-07T20:26:18.9069171Z 
2025-05-07T20:26:18.9069175Z 
2025-05-07T20:26:18.9069179Z 
2025-05-07T20:26:18.9308503Z cuda-nvcc-dev_linux- | 10.8 MB   | ##6        |  26% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.9308845Z 
2025-05-07T20:26:18.9308849Z 
2025-05-07T20:26:18.9308853Z 
2025-05-07T20:26:18.9308856Z 
2025-05-07T20:26:18.9308860Z 
2025-05-07T20:26:18.9308864Z 
2025-05-07T20:26:18.9308868Z 
2025-05-07T20:26:18.9308871Z 
2025-05-07T20:26:18.9308900Z 
2025-05-07T20:26:18.9308915Z 
2025-05-07T20:26:18.9308918Z 
2025-05-07T20:26:18.9308922Z 
2025-05-07T20:26:18.9308925Z 
2025-05-07T20:26:18.9308935Z 
2025-05-07T20:26:18.9308939Z 
2025-05-07T20:26:18.9309778Z 
2025-05-07T20:26:18.9431340Z cuda-sanitizer-api-1 | 8.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.9431735Z 
2025-05-07T20:26:18.9431741Z 
2025-05-07T20:26:18.9431746Z 
2025-05-07T20:26:18.9431751Z 
2025-05-07T20:26:18.9431756Z 
2025-05-07T20:26:18.9431761Z 
2025-05-07T20:26:18.9431766Z 
2025-05-07T20:26:18.9431772Z 
2025-05-07T20:26:18.9431777Z 
2025-05-07T20:26:18.9431782Z 
2025-05-07T20:26:18.9431787Z 
2025-05-07T20:26:18.9431792Z 
2025-05-07T20:26:18.9431797Z 
2025-05-07T20:26:18.9431802Z 
2025-05-07T20:26:18.9434641Z 
2025-05-07T20:26:18.9960739Z cuda-nvvm-tools-12.6 | 10.4 MB   | ###        |  31% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.9961083Z 
2025-05-07T20:26:18.9961087Z 
2025-05-07T20:26:18.9961091Z 
2025-05-07T20:26:18.9961384Z 
2025-05-07T20:26:18.9961389Z 
2025-05-07T20:26:18.9961393Z 
2025-05-07T20:26:18.9961396Z 
2025-05-07T20:26:18.9961400Z 
2025-05-07T20:26:18.9961404Z 
2025-05-07T20:26:18.9961407Z 
2025-05-07T20:26:18.9961411Z 
2025-05-07T20:26:18.9962756Z 
2025-05-07T20:26:19.0167629Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.0167947Z 
2025-05-07T20:26:19.0167951Z 
2025-05-07T20:26:19.0167955Z 
2025-05-07T20:26:19.0167967Z 
2025-05-07T20:26:19.0167971Z 
2025-05-07T20:26:19.0167975Z 
2025-05-07T20:26:19.0167978Z 
2025-05-07T20:26:19.0167982Z 
2025-05-07T20:26:19.0167985Z 
2025-05-07T20:26:19.0167989Z 
2025-05-07T20:26:19.0167993Z 
2025-05-07T20:26:19.0167996Z 
2025-05-07T20:26:19.0168000Z 
2025-05-07T20:26:19.0168004Z 
2025-05-07T20:26:19.0310011Z cuda-nvcc-dev_linux- | 10.8 MB   | #####2     |  52% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.0310348Z 
2025-05-07T20:26:19.0310352Z 
2025-05-07T20:26:19.0310355Z 
2025-05-07T20:26:19.0310384Z 
2025-05-07T20:26:19.0310388Z 
2025-05-07T20:26:19.0310391Z 
2025-05-07T20:26:19.0310395Z 
2025-05-07T20:26:19.0310399Z 
2025-05-07T20:26:19.0310402Z 
2025-05-07T20:26:19.0310406Z 
2025-05-07T20:26:19.0310409Z 
2025-05-07T20:26:19.0310413Z 
2025-05-07T20:26:19.0310417Z 
2025-05-07T20:26:19.0310420Z 
2025-05-07T20:26:19.0310424Z 
2025-05-07T20:26:19.0310435Z 
2025-05-07T20:26:19.0583285Z cuda-sanitizer-api-1 | 8.9 MB    | ##9        |  29% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.0583631Z 
2025-05-07T20:26:19.0583635Z 
2025-05-07T20:26:19.0583639Z 
2025-05-07T20:26:19.0583643Z 
2025-05-07T20:26:19.0583646Z 
2025-05-07T20:26:19.0583650Z 
2025-05-07T20:26:19.0583654Z 
2025-05-07T20:26:19.0583658Z 
2025-05-07T20:26:19.0583661Z 
2025-05-07T20:26:19.0583665Z 
2025-05-07T20:26:19.0583669Z 
2025-05-07T20:26:19.0583673Z 
2025-05-07T20:26:19.0583676Z 
2025-05-07T20:26:19.0583686Z 
2025-05-07T20:26:19.0583690Z 
2025-05-07T20:26:19.0657323Z cuda-nvvm-tools-12.6 | 10.4 MB   | ######     |  61% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.0657902Z 
2025-05-07T20:26:19.0657907Z 
2025-05-07T20:26:19.0657918Z 
2025-05-07T20:26:19.0657922Z 
2025-05-07T20:26:19.0657925Z 
2025-05-07T20:26:19.0657929Z 
2025-05-07T20:26:19.0657932Z 
2025-05-07T20:26:19.0657936Z 
2025-05-07T20:26:19.0657940Z 
2025-05-07T20:26:19.0657943Z 
2025-05-07T20:26:19.0657947Z 
2025-05-07T20:26:19.0657951Z 
2025-05-07T20:26:19.0657954Z 
2025-05-07T20:26:19.0657958Z 
2025-05-07T20:26:19.0657962Z 
2025-05-07T20:26:19.0657965Z 
2025-05-07T20:26:19.0659114Z 
2025-05-07T20:26:19.1316204Z cuda-nvvm-impl-12.6. | 7.7 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.1316553Z 
2025-05-07T20:26:19.1316558Z 
2025-05-07T20:26:19.1316563Z 
2025-05-07T20:26:19.1316568Z 
2025-05-07T20:26:19.1316589Z 
2025-05-07T20:26:19.1316595Z 
2025-05-07T20:26:19.1316601Z 
2025-05-07T20:26:19.1316605Z 
2025-05-07T20:26:19.1316609Z 
2025-05-07T20:26:19.1316614Z 
2025-05-07T20:26:19.1316646Z 
2025-05-07T20:26:19.1316662Z 
2025-05-07T20:26:19.1316666Z 
2025-05-07T20:26:19.1316670Z 
2025-05-07T20:26:19.1316674Z 
2025-05-07T20:26:19.1316678Z 
2025-05-07T20:26:19.1334681Z cuda-sanitizer-api-1 | 8.9 MB    | #####9     |  59% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.1335030Z 
2025-05-07T20:26:19.1335034Z 
2025-05-07T20:26:19.1335038Z 
2025-05-07T20:26:19.1335041Z 
2025-05-07T20:26:19.1335046Z 
2025-05-07T20:26:19.1335052Z 
2025-05-07T20:26:19.1335057Z 
2025-05-07T20:26:19.1335061Z 
2025-05-07T20:26:19.1335065Z 
2025-05-07T20:26:19.1335068Z 
2025-05-07T20:26:19.1335072Z 
2025-05-07T20:26:19.1335076Z 
2025-05-07T20:26:19.1335079Z 
2025-05-07T20:26:19.1335083Z 
2025-05-07T20:26:19.1657819Z cuda-nvcc-dev_linux- | 10.8 MB   | #######6   |  77% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.1658267Z 
2025-05-07T20:26:19.1658271Z 
2025-05-07T20:26:19.1658275Z 
2025-05-07T20:26:19.1658279Z 
2025-05-07T20:26:19.1658282Z 
2025-05-07T20:26:19.1658516Z 
2025-05-07T20:26:19.1658533Z 
2025-05-07T20:26:19.1658536Z 
2025-05-07T20:26:19.1658540Z 
2025-05-07T20:26:19.1658543Z 
2025-05-07T20:26:19.1658547Z 
2025-05-07T20:26:19.1658551Z 
2025-05-07T20:26:19.1658554Z 
2025-05-07T20:26:19.1658558Z 
2025-05-07T20:26:19.1658561Z 
2025-05-07T20:26:19.1658565Z 
2025-05-07T20:26:19.1663577Z 
2025-05-07T20:26:19.1922384Z cuda-nvvm-impl-12.6. | 7.7 MB    | ###4       |  35% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.1922784Z 
2025-05-07T20:26:19.1922788Z 
2025-05-07T20:26:19.1922792Z 
2025-05-07T20:26:19.1922795Z 
2025-05-07T20:26:19.1922799Z 
2025-05-07T20:26:19.1922803Z 
2025-05-07T20:26:19.1922806Z 
2025-05-07T20:26:19.1922810Z 
2025-05-07T20:26:19.1922814Z 
2025-05-07T20:26:19.1922817Z 
2025-05-07T20:26:19.1922821Z 
2025-05-07T20:26:19.1922825Z 
2025-05-07T20:26:19.1922828Z 
2025-05-07T20:26:19.1922832Z 
2025-05-07T20:26:19.1922836Z 
2025-05-07T20:26:19.2341696Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########9  |  89% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.2342032Z 
2025-05-07T20:26:19.2342036Z 
2025-05-07T20:26:19.2342039Z 
2025-05-07T20:26:19.2342043Z 
2025-05-07T20:26:19.2342055Z 
2025-05-07T20:26:19.2342059Z 
2025-05-07T20:26:19.2342063Z 
2025-05-07T20:26:19.2342066Z 
2025-05-07T20:26:19.2342070Z 
2025-05-07T20:26:19.2342074Z 
2025-05-07T20:26:19.2342077Z 
2025-05-07T20:26:19.2342081Z 
2025-05-07T20:26:19.2342085Z 
2025-05-07T20:26:19.2342088Z 
2025-05-07T20:26:19.2342092Z 
2025-05-07T20:26:19.2344136Z 
2025-05-07T20:26:19.2374998Z cuda-sanitizer-api-1 | 8.9 MB    | ########8  |  89% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.2375345Z 
2025-05-07T20:26:19.2375349Z 
2025-05-07T20:26:19.2375352Z 
2025-05-07T20:26:19.2375356Z 
2025-05-07T20:26:19.2375360Z 
2025-05-07T20:26:19.2375363Z 
2025-05-07T20:26:19.2375367Z 
2025-05-07T20:26:19.2375370Z 
2025-05-07T20:26:19.2375374Z 
2025-05-07T20:26:19.2375378Z 
2025-05-07T20:26:19.2375381Z 
2025-05-07T20:26:19.2375385Z 
2025-05-07T20:26:19.2375397Z 
2025-05-07T20:26:19.2375613Z 
2025-05-07T20:26:19.2699362Z cuda-nvcc-dev_linux- | 10.8 MB   | #########9 | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.2699807Z 
2025-05-07T20:26:19.2699813Z 
2025-05-07T20:26:19.2699818Z 
2025-05-07T20:26:19.2699833Z 
2025-05-07T20:26:19.2699838Z 
2025-05-07T20:26:19.2699843Z 
2025-05-07T20:26:19.2699848Z 
2025-05-07T20:26:19.2699853Z 
2025-05-07T20:26:19.2699859Z 
2025-05-07T20:26:19.2699864Z 
2025-05-07T20:26:19.2699869Z 
2025-05-07T20:26:19.2699874Z 
2025-05-07T20:26:19.2699879Z 
2025-05-07T20:26:19.2699885Z 
2025-05-07T20:26:19.2699889Z 
2025-05-07T20:26:19.2699892Z 
2025-05-07T20:26:19.2699896Z 
2025-05-07T20:26:19.5920279Z cuda-nvvm-impl-12.6. | 7.7 MB    | ######9    |  69% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.5920635Z 
2025-05-07T20:26:19.5920639Z 
2025-05-07T20:26:19.5920644Z 
2025-05-07T20:26:19.5920647Z 
2025-05-07T20:26:19.5920652Z 
2025-05-07T20:26:19.5920657Z 
2025-05-07T20:26:19.5920661Z 
2025-05-07T20:26:19.5920686Z 
2025-05-07T20:26:19.5920704Z 
2025-05-07T20:26:19.5920707Z 
2025-05-07T20:26:19.5920711Z 
2025-05-07T20:26:19.5920715Z 
2025-05-07T20:26:19.5920718Z 
2025-05-07T20:26:19.5920722Z 
2025-05-07T20:26:19.5920726Z 
2025-05-07T20:26:19.5921396Z 
2025-05-07T20:26:19.6118362Z cuda-sanitizer-api-1 | 8.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.6118693Z 
2025-05-07T20:26:19.6118697Z 
2025-05-07T20:26:19.6118701Z 
2025-05-07T20:26:19.6118705Z 
2025-05-07T20:26:19.6118708Z 
2025-05-07T20:26:19.6118712Z 
2025-05-07T20:26:19.6118716Z 
2025-05-07T20:26:19.6118719Z 
2025-05-07T20:26:19.6118723Z 
2025-05-07T20:26:19.6118727Z 
2025-05-07T20:26:19.6118730Z 
2025-05-07T20:26:19.6118741Z 
2025-05-07T20:26:19.6118745Z 
2025-05-07T20:26:19.6118749Z 
2025-05-07T20:26:19.6120275Z 
2025-05-07T20:26:19.6230029Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.6230358Z 
2025-05-07T20:26:19.6230572Z 
2025-05-07T20:26:19.6230586Z 
2025-05-07T20:26:19.6230589Z 
2025-05-07T20:26:19.6230593Z 
2025-05-07T20:26:19.6230597Z 
2025-05-07T20:26:19.6230600Z 
2025-05-07T20:26:19.6230604Z 
2025-05-07T20:26:19.6230608Z 
2025-05-07T20:26:19.6230611Z 
2025-05-07T20:26:19.6230615Z 
2025-05-07T20:26:19.6230619Z 
2025-05-07T20:26:19.6230622Z 
2025-05-07T20:26:19.6230626Z 
2025-05-07T20:26:19.6230629Z 
2025-05-07T20:26:19.6230633Z 
2025-05-07T20:26:19.6230637Z 
2025-05-07T20:26:19.6301450Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.6301858Z 
2025-05-07T20:26:19.6301863Z 
2025-05-07T20:26:19.6301866Z 
2025-05-07T20:26:19.6301870Z 
2025-05-07T20:26:19.6301874Z 
2025-05-07T20:26:19.6301877Z 
2025-05-07T20:26:19.6301881Z 
2025-05-07T20:26:19.6301884Z 
2025-05-07T20:26:19.6301895Z 
2025-05-07T20:26:19.6301898Z 
2025-05-07T20:26:19.6301902Z 
2025-05-07T20:26:19.6301906Z 
2025-05-07T20:26:19.6301909Z 
2025-05-07T20:26:19.6301913Z 
2025-05-07T20:26:19.6301926Z 
2025-05-07T20:26:19.6301934Z 
2025-05-07T20:26:19.6301938Z 
2025-05-07T20:26:19.6302484Z 
2025-05-07T20:26:19.6452211Z libglib-2.84.0       | 3.8 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.6452739Z 
2025-05-07T20:26:19.6452750Z 
2025-05-07T20:26:19.6452760Z 
2025-05-07T20:26:19.6452770Z 
2025-05-07T20:26:19.6452779Z 
2025-05-07T20:26:19.6452788Z 
2025-05-07T20:26:19.6452796Z 
2025-05-07T20:26:19.6452805Z 
2025-05-07T20:26:19.6452811Z 
2025-05-07T20:26:19.6452819Z 
2025-05-07T20:26:19.6452826Z 
2025-05-07T20:26:19.6452833Z 
2025-05-07T20:26:19.6452838Z 
2025-05-07T20:26:19.6454086Z 
2025-05-07T20:26:19.6658502Z cuda-nvcc-dev_linux- | 10.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.6658891Z 
2025-05-07T20:26:19.6658898Z 
2025-05-07T20:26:19.6658903Z 
2025-05-07T20:26:19.6658909Z 
2025-05-07T20:26:19.6658914Z 
2025-05-07T20:26:19.6658919Z 
2025-05-07T20:26:19.6658925Z 
2025-05-07T20:26:19.6658930Z 
2025-05-07T20:26:19.6659560Z 
2025-05-07T20:26:19.6659581Z 
2025-05-07T20:26:19.6659587Z 
2025-05-07T20:26:19.6659592Z 
2025-05-07T20:26:19.6659598Z 
2025-05-07T20:26:19.6659603Z 
2025-05-07T20:26:19.6659607Z 
2025-05-07T20:26:19.6659612Z 
2025-05-07T20:26:19.6659617Z 
2025-05-07T20:26:19.6659622Z 
2025-05-07T20:26:19.6661108Z 
2025-05-07T20:26:19.7305108Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.7305592Z 
2025-05-07T20:26:19.7305597Z 
2025-05-07T20:26:19.7305601Z 
2025-05-07T20:26:19.7305605Z 
2025-05-07T20:26:19.7305608Z 
2025-05-07T20:26:19.7305613Z 
2025-05-07T20:26:19.7305618Z 
2025-05-07T20:26:19.7305621Z 
2025-05-07T20:26:19.7305625Z 
2025-05-07T20:26:19.7305629Z 
2025-05-07T20:26:19.7305633Z 
2025-05-07T20:26:19.7305636Z 
2025-05-07T20:26:19.7305640Z 
2025-05-07T20:26:19.7305644Z 
2025-05-07T20:26:19.7305648Z 
2025-05-07T20:26:19.7305651Z 
2025-05-07T20:26:19.7305655Z 
2025-05-07T20:26:19.7320363Z 
2025-05-07T20:26:19.7659177Z libglib-2.84.0       | 3.8 MB    | ########5  |  86% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.7659787Z 
2025-05-07T20:26:19.7659792Z 
2025-05-07T20:26:19.7659795Z 
2025-05-07T20:26:19.7659799Z 
2025-05-07T20:26:19.7659803Z 
2025-05-07T20:26:19.7659806Z 
2025-05-07T20:26:19.7659810Z 
2025-05-07T20:26:19.7659814Z 
2025-05-07T20:26:19.7659817Z 
2025-05-07T20:26:19.7659821Z 
2025-05-07T20:26:19.7659824Z 
2025-05-07T20:26:19.7659828Z 
2025-05-07T20:26:19.7659832Z 
2025-05-07T20:26:19.7659835Z 
2025-05-07T20:26:19.7659839Z 
2025-05-07T20:26:19.7659843Z 
2025-05-07T20:26:19.7659846Z 
2025-05-07T20:26:19.7659850Z 
2025-05-07T20:26:19.7659853Z 
2025-05-07T20:26:19.8753423Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.8753741Z 
2025-05-07T20:26:19.8753746Z 
2025-05-07T20:26:19.8753750Z 
2025-05-07T20:26:19.8753754Z 
2025-05-07T20:26:19.8753758Z 
2025-05-07T20:26:19.8753763Z 
2025-05-07T20:26:19.8753768Z 
2025-05-07T20:26:19.8753773Z 
2025-05-07T20:26:19.8754095Z 
2025-05-07T20:26:19.8754118Z 
2025-05-07T20:26:19.8754123Z 
2025-05-07T20:26:19.8754128Z 
2025-05-07T20:26:19.8754134Z 
2025-05-07T20:26:19.8754139Z 
2025-05-07T20:26:19.8754143Z 
2025-05-07T20:26:19.8754149Z 
2025-05-07T20:26:19.8754154Z 
2025-05-07T20:26:19.8758653Z 
2025-05-07T20:26:19.9036718Z libglib-2.84.0       | 3.8 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.9037062Z 
2025-05-07T20:26:19.9037066Z 
2025-05-07T20:26:19.9037070Z 
2025-05-07T20:26:19.9037074Z 
2025-05-07T20:26:19.9037077Z 
2025-05-07T20:26:19.9037081Z 
2025-05-07T20:26:19.9037085Z 
2025-05-07T20:26:19.9037089Z 
2025-05-07T20:26:19.9037092Z 
2025-05-07T20:26:19.9037096Z 
2025-05-07T20:26:19.9037100Z 
2025-05-07T20:26:19.9037103Z 
2025-05-07T20:26:19.9037107Z 
2025-05-07T20:26:19.9037111Z 
2025-05-07T20:26:19.9037115Z 
2025-05-07T20:26:19.9037118Z 
2025-05-07T20:26:19.9037128Z 
2025-05-07T20:26:19.9037132Z 
2025-05-07T20:26:19.9037136Z 
2025-05-07T20:26:21.1363441Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.1363770Z 
2025-05-07T20:26:21.1363774Z 
2025-05-07T20:26:21.1363778Z 
2025-05-07T20:26:21.1363782Z 
2025-05-07T20:26:21.1363786Z 
2025-05-07T20:26:21.1364984Z 
2025-05-07T20:26:21.9644673Z libcusolver-11.7.1.2 | 95.8 MB   | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:21.9645104Z 
2025-05-07T20:26:21.9645110Z 
2025-05-07T20:26:21.9645115Z 
2025-05-07T20:26:21.9645120Z 
2025-05-07T20:26:21.9645125Z 
2025-05-07T20:26:22.4272614Z cuda-nvvp-12.6.80    | 109.3 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:22.4272929Z 
2025-05-07T20:26:22.4272934Z 
2025-05-07T20:26:22.4272938Z 
2025-05-07T20:26:22.4272941Z 
2025-05-07T20:26:22.4272945Z 
2025-05-07T20:26:22.4272949Z 
2025-05-07T20:26:22.4273292Z 
2025-05-07T20:26:22.5731973Z libnpp-12.3.1.54     | 93.4 MB   | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:26:22.5732342Z 
2025-05-07T20:26:22.5732347Z 
2025-05-07T20:26:22.5732351Z 
2025-05-07T20:26:22.5732379Z 
2025-05-07T20:26:22.5732676Z 
2025-05-07T20:26:22.5732680Z 
2025-05-07T20:26:22.5732684Z 
2025-05-07T20:26:22.5732687Z 
2025-05-07T20:26:22.5734414Z 
2025-05-07T20:26:22.8076770Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.8077103Z 
2025-05-07T20:26:22.8077107Z 
2025-05-07T20:26:22.8077111Z 
2025-05-07T20:26:22.8077115Z 
2025-05-07T20:26:22.8077119Z 
2025-05-07T20:26:22.8077122Z 
2025-05-07T20:26:22.8077126Z 
2025-05-07T20:26:22.8079018Z 
2025-05-07T20:26:23.0564877Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:26:23.1923818Z nsight-compute-2024. | 443.1 MB  | ########## | 100% 
2025-05-07T20:26:23.1924098Z 
2025-05-07T20:26:23.1924414Z 
2025-05-07T20:26:23.1924426Z 
2025-05-07T20:26:23.1924499Z 
2025-05-07T20:26:23.1924505Z 
2025-05-07T20:26:23.1924511Z 
2025-05-07T20:26:23.1924516Z 
2025-05-07T20:26:23.1924522Z 
2025-05-07T20:26:23.1924532Z 
2025-05-07T20:26:23.1924537Z 
2025-05-07T20:26:23.2386162Z gds-tools-1.11.1.6   | 37.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.2386495Z 
2025-05-07T20:26:23.2386499Z 
2025-05-07T20:26:23.2386503Z 
2025-05-07T20:26:23.2386507Z 
2025-05-07T20:26:23.2386510Z 
2025-05-07T20:26:23.2386514Z 
2025-05-07T20:26:23.2386518Z 
2025-05-07T20:26:23.2386523Z 
2025-05-07T20:26:23.2386526Z 
2025-05-07T20:26:23.2386530Z 
2025-05-07T20:26:23.2386540Z 
2025-05-07T20:26:23.3981844Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.3982191Z 
2025-05-07T20:26:23.3982195Z 
2025-05-07T20:26:23.3982199Z 
2025-05-07T20:26:23.3982203Z 
2025-05-07T20:26:23.3982207Z 
2025-05-07T20:26:23.3982211Z 
2025-05-07T20:26:23.3982214Z 
2025-05-07T20:26:23.3982218Z 
2025-05-07T20:26:23.3982222Z 
2025-05-07T20:26:23.3982225Z 
2025-05-07T20:26:23.3982229Z 
2025-05-07T20:26:23.3982233Z 
2025-05-07T20:26:23.3982237Z 
2025-05-07T20:26:23.4990881Z libnvjitlink-12.6.85 | 14.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.4991252Z 
2025-05-07T20:26:23.4991256Z 
2025-05-07T20:26:23.4991260Z 
2025-05-07T20:26:23.4991264Z 
2025-05-07T20:26:23.4991276Z 
2025-05-07T20:26:23.4991280Z 
2025-05-07T20:26:23.4991283Z 
2025-05-07T20:26:23.4991287Z 
2025-05-07T20:26:23.4991291Z 
2025-05-07T20:26:23.4991294Z 
2025-05-07T20:26:23.4991298Z 
2025-05-07T20:26:23.4991890Z 
2025-05-07T20:26:23.5854504Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.5854831Z 
2025-05-07T20:26:23.5854835Z 
2025-05-07T20:26:23.5854839Z 
2025-05-07T20:26:23.5854842Z 
2025-05-07T20:26:23.5854846Z 
2025-05-07T20:26:23.5854850Z 
2025-05-07T20:26:23.5854853Z 
2025-05-07T20:26:23.5854869Z 
2025-05-07T20:26:23.5854872Z 
2025-05-07T20:26:23.5854876Z 
2025-05-07T20:26:23.5854879Z 
2025-05-07T20:26:23.5854883Z 
2025-05-07T20:26:23.5854887Z 
2025-05-07T20:26:23.5854891Z 
2025-05-07T20:26:23.5854895Z 
2025-05-07T20:26:23.5854898Z 
2025-05-07T20:26:23.6638869Z cuda-sanitizer-api-1 | 8.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.6639261Z 
2025-05-07T20:26:23.6639265Z 
2025-05-07T20:26:23.6639268Z 
2025-05-07T20:26:23.6639272Z 
2025-05-07T20:26:23.6639276Z 
2025-05-07T20:26:23.6639279Z 
2025-05-07T20:26:23.6639283Z 
2025-05-07T20:26:23.6639286Z 
2025-05-07T20:26:23.6639290Z 
2025-05-07T20:26:23.6639294Z 
2025-05-07T20:26:23.6639297Z 
2025-05-07T20:26:23.6639301Z 
2025-05-07T20:26:23.6639304Z 
2025-05-07T20:26:23.6639308Z 
2025-05-07T20:26:23.6639311Z 
2025-05-07T20:26:23.7070309Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.7070705Z 
2025-05-07T20:26:23.7070710Z 
2025-05-07T20:26:23.7070713Z 
2025-05-07T20:26:23.7070717Z 
2025-05-07T20:26:23.7070721Z 
2025-05-07T20:26:23.7070725Z 
2025-05-07T20:26:23.7070729Z 
2025-05-07T20:26:23.7070734Z 
2025-05-07T20:26:23.7070739Z 
2025-05-07T20:26:23.7070742Z 
2025-05-07T20:26:23.7070746Z 
2025-05-07T20:26:23.7070775Z 
2025-05-07T20:26:23.7071006Z 
2025-05-07T20:26:23.7071010Z 
2025-05-07T20:26:23.7071014Z 
2025-05-07T20:26:23.7071017Z 
2025-05-07T20:26:23.7071021Z 
2025-05-07T20:26:23.9060949Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.9061293Z 
2025-05-07T20:26:23.9061297Z 
2025-05-07T20:26:23.9061301Z 
2025-05-07T20:26:23.9061305Z 
2025-05-07T20:26:23.9061309Z 
2025-05-07T20:26:23.9061321Z 
2025-05-07T20:26:23.9061325Z 
2025-05-07T20:26:23.9061329Z 
2025-05-07T20:26:23.9061332Z 
2025-05-07T20:26:23.9061336Z 
2025-05-07T20:26:23.9061340Z 
2025-05-07T20:26:23.9061343Z 
2025-05-07T20:26:23.9061347Z 
2025-05-07T20:26:23.9061351Z 
2025-05-07T20:26:23.9061354Z 
2025-05-07T20:26:23.9061366Z 
2025-05-07T20:26:23.9061370Z 
2025-05-07T20:26:23.9061373Z 
2025-05-07T20:26:23.9472493Z libglib-2.84.0       | 3.8 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.9472853Z 
2025-05-07T20:26:23.9472892Z 
2025-05-07T20:26:23.9472912Z 
2025-05-07T20:26:23.9472916Z 
2025-05-07T20:26:23.9472920Z 
2025-05-07T20:26:23.9472923Z 
2025-05-07T20:26:23.9472927Z 
2025-05-07T20:26:23.9472930Z 
2025-05-07T20:26:23.9472934Z 
2025-05-07T20:26:23.9472938Z 
2025-05-07T20:26:23.9472941Z 
2025-05-07T20:26:23.9472945Z 
2025-05-07T20:26:23.9472948Z 
2025-05-07T20:26:23.9472952Z 
2025-05-07T20:26:24.1298731Z cuda-nvcc-dev_linux- | 10.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.1299087Z 
2025-05-07T20:26:24.1299091Z 
2025-05-07T20:26:24.1299095Z 
2025-05-07T20:26:24.1299099Z 
2025-05-07T20:26:24.1299104Z 
2025-05-07T20:26:24.1299108Z 
2025-05-07T20:26:24.1299111Z 
2025-05-07T20:26:24.1299115Z 
2025-05-07T20:26:24.1299133Z 
2025-05-07T20:26:24.1299137Z 
2025-05-07T20:26:24.1299141Z 
2025-05-07T20:26:24.1299144Z 
2025-05-07T20:26:24.1299148Z 
2025-05-07T20:26:24.1299152Z 
2025-05-07T20:26:24.1299156Z 
2025-05-07T20:26:24.1299160Z 
2025-05-07T20:26:24.1299163Z 
2025-05-07T20:26:24.1299407Z 
2025-05-07T20:26:24.1299427Z 
2025-05-07T20:26:25.7375591Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:25.7375908Z 
2025-05-07T20:26:30.1984766Z libcublas-12.6.4.1   | 256.2 MB  | ########## | 100% [A
2025-05-07T20:26:30.1992307Z nsight-compute-2024. | 443.1 MB  | ########## | 100% 
2025-05-07T20:26:30.1992656Z 
2025-05-07T20:26:30.1992662Z 
2025-05-07T20:26:30.1992668Z 
2025-05-07T20:26:30.1992673Z 
2025-05-07T20:26:30.1992678Z 
2025-05-07T20:26:30.1992683Z 
2025-05-07T20:26:30.1992689Z 
2025-05-07T20:26:30.1992694Z 
2025-05-07T20:26:30.1992699Z 
2025-05-07T20:26:30.1992705Z 
2025-05-07T20:26:30.1992710Z 
2025-05-07T20:26:30.1992715Z 
2025-05-07T20:26:30.1992720Z 
2025-05-07T20:26:30.1992735Z 
2025-05-07T20:26:30.1992740Z 
2025-05-07T20:26:30.1992745Z 
2025-05-07T20:26:30.1992750Z 
2025-05-07T20:26:30.1992754Z 
2025-05-07T20:26:30.1992759Z 
2025-05-07T20:26:30.1992881Z                       
2025-05-07T20:26:30.1993351Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.1993760Z                                                      
2025-05-07T20:26:30.1993971Z 
2025-05-07T20:26:30.1994214Z                                                      [A
2025-05-07T20:26:30.1994427Z 
2025-05-07T20:26:30.1994431Z 
2025-05-07T20:26:30.1994618Z                                                      [A[A
2025-05-07T20:26:30.1994894Z 
2025-05-07T20:26:30.1994900Z 
2025-05-07T20:26:30.1994914Z 
2025-05-07T20:26:30.1995160Z                                                      [A[A[A
2025-05-07T20:26:30.1995443Z 
2025-05-07T20:26:30.1995448Z 
2025-05-07T20:26:30.1995453Z 
2025-05-07T20:26:30.1995459Z 
2025-05-07T20:26:30.1995712Z                                                      [A[A[A[A
2025-05-07T20:26:30.1995994Z 
2025-05-07T20:26:30.1996000Z 
2025-05-07T20:26:30.1996005Z 
2025-05-07T20:26:30.1996009Z 
2025-05-07T20:26:30.1996014Z 
2025-05-07T20:26:30.1996266Z                                                      [A[A[A[A[A
2025-05-07T20:26:30.1996888Z 
2025-05-07T20:26:30.1996894Z 
2025-05-07T20:26:30.1996899Z 
2025-05-07T20:26:30.1996916Z 
2025-05-07T20:26:30.1996921Z 
2025-05-07T20:26:30.1996926Z 
2025-05-07T20:26:30.1997184Z                                                      [A[A[A[A[A[A
2025-05-07T20:26:30.1997471Z 
2025-05-07T20:26:30.1997476Z 
2025-05-07T20:26:30.1997481Z 
2025-05-07T20:26:30.1997486Z 
2025-05-07T20:26:30.1997491Z 
2025-05-07T20:26:30.1997496Z 
2025-05-07T20:26:30.1997501Z 
2025-05-07T20:26:30.1997768Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:26:30.1998062Z 
2025-05-07T20:26:30.1998067Z 
2025-05-07T20:26:30.1998072Z 
2025-05-07T20:26:30.1998077Z 
2025-05-07T20:26:30.1998082Z 
2025-05-07T20:26:30.1998096Z 
2025-05-07T20:26:30.1998101Z 
2025-05-07T20:26:30.1998106Z 
2025-05-07T20:26:30.1998362Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:26:30.1998671Z 
2025-05-07T20:26:30.1998676Z 
2025-05-07T20:26:30.1998689Z 
2025-05-07T20:26:30.1998703Z 
2025-05-07T20:26:30.1998708Z 
2025-05-07T20:26:30.1998713Z 
2025-05-07T20:26:30.1998718Z 
2025-05-07T20:26:30.1998723Z 
2025-05-07T20:26:30.1998728Z 
2025-05-07T20:26:30.1998993Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.1999300Z 
2025-05-07T20:26:30.1999306Z 
2025-05-07T20:26:30.1999310Z 
2025-05-07T20:26:30.1999316Z 
2025-05-07T20:26:30.1999321Z 
2025-05-07T20:26:30.1999326Z 
2025-05-07T20:26:30.1999331Z 
2025-05-07T20:26:30.1999336Z 
2025-05-07T20:26:30.1999341Z 
2025-05-07T20:26:30.1999346Z 
2025-05-07T20:26:30.1999618Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.1999937Z 
2025-05-07T20:26:30.1999943Z 
2025-05-07T20:26:30.1999948Z 
2025-05-07T20:26:30.1999953Z 
2025-05-07T20:26:30.1999958Z 
2025-05-07T20:26:30.1999963Z 
2025-05-07T20:26:30.1999968Z 
2025-05-07T20:26:30.1999973Z 
2025-05-07T20:26:30.1999979Z 
2025-05-07T20:26:30.1999984Z 
2025-05-07T20:26:30.2000164Z 
2025-05-07T20:26:30.2000472Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2000784Z 
2025-05-07T20:26:30.2000790Z 
2025-05-07T20:26:30.2000795Z 
2025-05-07T20:26:30.2000800Z 
2025-05-07T20:26:30.2000805Z 
2025-05-07T20:26:30.2000810Z 
2025-05-07T20:26:30.2000823Z 
2025-05-07T20:26:30.2000828Z 
2025-05-07T20:26:30.2000833Z 
2025-05-07T20:26:30.2000838Z 
2025-05-07T20:26:30.2000843Z 
2025-05-07T20:26:30.2000849Z 
2025-05-07T20:26:30.2001132Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2001444Z 
2025-05-07T20:26:30.2001449Z 
2025-05-07T20:26:30.2001454Z 
2025-05-07T20:26:30.2001459Z 
2025-05-07T20:26:30.2001464Z 
2025-05-07T20:26:30.2001469Z 
2025-05-07T20:26:30.2001474Z 
2025-05-07T20:26:30.2001479Z 
2025-05-07T20:26:30.2001485Z 
2025-05-07T20:26:30.2001490Z 
2025-05-07T20:26:30.2001495Z 
2025-05-07T20:26:30.2001500Z 
2025-05-07T20:26:30.2001505Z 
2025-05-07T20:26:30.2001796Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2002111Z 
2025-05-07T20:26:30.2002116Z 
2025-05-07T20:26:30.2002122Z 
2025-05-07T20:26:30.2002127Z 
2025-05-07T20:26:30.2002132Z 
2025-05-07T20:26:30.2002137Z 
2025-05-07T20:26:30.2002142Z 
2025-05-07T20:26:30.2002147Z 
2025-05-07T20:26:30.2002151Z 
2025-05-07T20:26:30.2002157Z 
2025-05-07T20:26:30.2002162Z 
2025-05-07T20:26:30.2002167Z 
2025-05-07T20:26:30.2002172Z 
2025-05-07T20:26:30.2002177Z 
2025-05-07T20:26:30.2002483Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2002800Z 
2025-05-07T20:26:30.2002806Z 
2025-05-07T20:26:30.2002811Z 
2025-05-07T20:26:30.2002816Z 
2025-05-07T20:26:30.2002820Z 
2025-05-07T20:26:30.2002832Z 
2025-05-07T20:26:30.2002837Z 
2025-05-07T20:26:30.2002842Z 
2025-05-07T20:26:30.2002847Z 
2025-05-07T20:26:30.2002852Z 
2025-05-07T20:26:30.2002857Z 
2025-05-07T20:26:30.2002862Z 
2025-05-07T20:26:30.2002873Z 
2025-05-07T20:26:30.2003013Z 
2025-05-07T20:26:30.2003018Z 
2025-05-07T20:26:30.2003300Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2003662Z 
2025-05-07T20:26:30.2003668Z 
2025-05-07T20:26:30.2003672Z 
2025-05-07T20:26:30.2003678Z 
2025-05-07T20:26:30.2003683Z 
2025-05-07T20:26:30.2003688Z 
2025-05-07T20:26:30.2003693Z 
2025-05-07T20:26:30.2003698Z 
2025-05-07T20:26:30.2003703Z 
2025-05-07T20:26:30.2003708Z 
2025-05-07T20:26:30.2003713Z 
2025-05-07T20:26:30.2003718Z 
2025-05-07T20:26:30.2003723Z 
2025-05-07T20:26:30.2003728Z 
2025-05-07T20:26:30.2003733Z 
2025-05-07T20:26:30.2003738Z 
2025-05-07T20:26:30.2004039Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2004364Z 
2025-05-07T20:26:30.2004368Z 
2025-05-07T20:26:30.2004372Z 
2025-05-07T20:26:30.2004376Z 
2025-05-07T20:26:30.2004379Z 
2025-05-07T20:26:30.2004383Z 
2025-05-07T20:26:30.2004393Z 
2025-05-07T20:26:30.2004407Z 
2025-05-07T20:26:30.2004411Z 
2025-05-07T20:26:30.2004415Z 
2025-05-07T20:26:30.2004418Z 
2025-05-07T20:26:30.2004422Z 
2025-05-07T20:26:30.2004425Z 
2025-05-07T20:26:30.2004429Z 
2025-05-07T20:26:30.2004432Z 
2025-05-07T20:26:30.2004436Z 
2025-05-07T20:26:30.2004440Z 
2025-05-07T20:26:30.2004690Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2005028Z 
2025-05-07T20:26:30.2005034Z 
2025-05-07T20:26:30.2005039Z 
2025-05-07T20:26:30.2005044Z 
2025-05-07T20:26:30.2005050Z 
2025-05-07T20:26:30.2005055Z 
2025-05-07T20:26:30.2005061Z 
2025-05-07T20:26:30.2005066Z 
2025-05-07T20:26:30.2005071Z 
2025-05-07T20:26:30.2005076Z 
2025-05-07T20:26:30.2005081Z 
2025-05-07T20:26:30.2005085Z 
2025-05-07T20:26:30.2005090Z 
2025-05-07T20:26:30.2005095Z 
2025-05-07T20:26:30.2005100Z 
2025-05-07T20:26:30.2005105Z 
2025-05-07T20:26:30.2005110Z 
2025-05-07T20:26:30.2005115Z 
2025-05-07T20:26:30.2005584Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2005960Z 
2025-05-07T20:26:30.2005965Z 
2025-05-07T20:26:30.2006122Z [A
2025-05-07T20:26:30.2006265Z 
2025-05-07T20:26:30.2006271Z 
2025-05-07T20:26:30.2006414Z [A[A
2025-05-07T20:26:30.2006571Z 
2025-05-07T20:26:30.2006575Z 
2025-05-07T20:26:30.2006581Z 
2025-05-07T20:26:30.2006734Z [A[A[A
2025-05-07T20:26:30.2006897Z 
2025-05-07T20:26:30.2006903Z 
2025-05-07T20:26:30.2006908Z 
2025-05-07T20:26:30.2006913Z 
2025-05-07T20:26:30.2007077Z [A[A[A[A
2025-05-07T20:26:30.2007245Z 
2025-05-07T20:26:30.2007251Z 
2025-05-07T20:26:30.2007256Z 
2025-05-07T20:26:30.2007261Z 
2025-05-07T20:26:30.2007266Z 
2025-05-07T20:26:30.2007539Z [A[A[A[A[A
2025-05-07T20:26:30.2007714Z 
2025-05-07T20:26:30.2007728Z 
2025-05-07T20:26:30.2007733Z 
2025-05-07T20:26:30.2007743Z 
2025-05-07T20:26:30.2007748Z 
2025-05-07T20:26:30.2007753Z 
2025-05-07T20:26:30.2008214Z [A[A[A[A[A[A
2025-05-07T20:26:30.2008409Z 
2025-05-07T20:26:30.2008427Z 
2025-05-07T20:26:30.2008438Z 
2025-05-07T20:26:30.2008443Z 
2025-05-07T20:26:30.2008448Z 
2025-05-07T20:26:30.2008454Z 
2025-05-07T20:26:30.2008472Z 
2025-05-07T20:26:30.2008825Z [A[A[A[A[A[A[A
2025-05-07T20:26:30.2008998Z 
2025-05-07T20:26:30.2009002Z 
2025-05-07T20:26:30.2009006Z 
2025-05-07T20:26:30.2009010Z 
2025-05-07T20:26:30.2009013Z 
2025-05-07T20:26:30.2009017Z 
2025-05-07T20:26:30.2009021Z 
2025-05-07T20:26:30.2009031Z 
2025-05-07T20:26:30.2009416Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2009634Z 
2025-05-07T20:26:30.2009648Z 
2025-05-07T20:26:30.2009663Z 
2025-05-07T20:26:30.2009668Z 
2025-05-07T20:26:30.2009673Z 
2025-05-07T20:26:30.2009678Z 
2025-05-07T20:26:30.2009683Z 
2025-05-07T20:26:30.2009689Z 
2025-05-07T20:26:30.2009694Z 
2025-05-07T20:26:30.2010010Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2010233Z 
2025-05-07T20:26:30.2010247Z 
2025-05-07T20:26:30.2010252Z 
2025-05-07T20:26:30.2010257Z 
2025-05-07T20:26:30.2010262Z 
2025-05-07T20:26:30.2010267Z 
2025-05-07T20:26:30.2010411Z 
2025-05-07T20:26:30.2010416Z 
2025-05-07T20:26:30.2010421Z 
2025-05-07T20:26:30.2010427Z 
2025-05-07T20:26:30.2010639Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2010881Z 
2025-05-07T20:26:30.2010886Z 
2025-05-07T20:26:30.2010891Z 
2025-05-07T20:26:30.2010896Z 
2025-05-07T20:26:30.2010915Z 
2025-05-07T20:26:30.2010920Z 
2025-05-07T20:26:30.2010925Z 
2025-05-07T20:26:30.2010930Z 
2025-05-07T20:26:30.2010934Z 
2025-05-07T20:26:30.2010939Z 
2025-05-07T20:26:30.2010944Z 
2025-05-07T20:26:30.2011141Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2011377Z 
2025-05-07T20:26:30.2011382Z 
2025-05-07T20:26:30.2011388Z 
2025-05-07T20:26:30.2011392Z 
2025-05-07T20:26:30.2011404Z 
2025-05-07T20:26:30.2011409Z 
2025-05-07T20:26:30.2011414Z 
2025-05-07T20:26:30.2011419Z 
2025-05-07T20:26:30.2011425Z 
2025-05-07T20:26:30.2011429Z 
2025-05-07T20:26:30.2011443Z 
2025-05-07T20:26:30.2011448Z 
2025-05-07T20:26:30.2011825Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2012099Z 
2025-05-07T20:26:30.2012123Z 
2025-05-07T20:26:30.2012129Z 
2025-05-07T20:26:30.2012134Z 
2025-05-07T20:26:30.2012140Z 
2025-05-07T20:26:30.2012145Z 
2025-05-07T20:26:30.2012150Z 
2025-05-07T20:26:30.2012155Z 
2025-05-07T20:26:30.2012170Z 
2025-05-07T20:26:30.2012175Z 
2025-05-07T20:26:30.2012180Z 
2025-05-07T20:26:30.2012186Z 
2025-05-07T20:26:30.2012190Z 
2025-05-07T20:26:30.2012398Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2012657Z 
2025-05-07T20:26:30.2012670Z 
2025-05-07T20:26:30.2012684Z 
2025-05-07T20:26:30.2012689Z 
2025-05-07T20:26:30.2012694Z 
2025-05-07T20:26:30.2012699Z 
2025-05-07T20:26:30.2012704Z 
2025-05-07T20:26:30.2012709Z 
2025-05-07T20:26:30.2012714Z 
2025-05-07T20:26:30.2012719Z 
2025-05-07T20:26:30.2012724Z 
2025-05-07T20:26:30.2012729Z 
2025-05-07T20:26:30.2012734Z 
2025-05-07T20:26:30.2012739Z 
2025-05-07T20:26:30.2013202Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2013488Z 
2025-05-07T20:26:30.2013493Z 
2025-05-07T20:26:30.2013498Z 
2025-05-07T20:26:30.2013661Z 
2025-05-07T20:26:30.2013676Z 
2025-05-07T20:26:30.2013681Z 
2025-05-07T20:26:30.2013698Z 
2025-05-07T20:26:30.2013703Z 
2025-05-07T20:26:30.2013708Z 
2025-05-07T20:26:30.2013728Z 
2025-05-07T20:26:30.2013733Z 
2025-05-07T20:26:30.2013738Z 
2025-05-07T20:26:30.2013743Z 
2025-05-07T20:26:30.2013747Z 
2025-05-07T20:26:30.2013752Z 
2025-05-07T20:26:30.2013978Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2014268Z 
2025-05-07T20:26:30.2014274Z 
2025-05-07T20:26:30.2014279Z 
2025-05-07T20:26:30.2014285Z 
2025-05-07T20:26:30.2014290Z 
2025-05-07T20:26:30.2014295Z 
2025-05-07T20:26:30.2014300Z 
2025-05-07T20:26:30.2014305Z 
2025-05-07T20:26:30.2014310Z 
2025-05-07T20:26:30.2014315Z 
2025-05-07T20:26:30.2014320Z 
2025-05-07T20:26:30.2014326Z 
2025-05-07T20:26:30.2014331Z 
2025-05-07T20:26:30.2014336Z 
2025-05-07T20:26:30.2014341Z 
2025-05-07T20:26:30.2014346Z 
2025-05-07T20:26:30.2014596Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2014880Z 
2025-05-07T20:26:30.2014894Z 
2025-05-07T20:26:30.2014906Z 
2025-05-07T20:26:30.2014912Z 
2025-05-07T20:26:30.2014917Z 
2025-05-07T20:26:30.2014922Z 
2025-05-07T20:26:30.2014927Z 
2025-05-07T20:26:30.2014932Z 
2025-05-07T20:26:30.2014949Z 
2025-05-07T20:26:30.2014955Z 
2025-05-07T20:26:30.2014960Z 
2025-05-07T20:26:30.2014965Z 
2025-05-07T20:26:30.2014970Z 
2025-05-07T20:26:30.2014975Z 
2025-05-07T20:26:30.2014980Z 
2025-05-07T20:26:30.2014985Z 
2025-05-07T20:26:30.2014990Z 
2025-05-07T20:26:30.2015218Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2015500Z 
2025-05-07T20:26:30.2015506Z 
2025-05-07T20:26:30.2015511Z 
2025-05-07T20:26:30.2015517Z 
2025-05-07T20:26:30.2015523Z 
2025-05-07T20:26:30.2015530Z 
2025-05-07T20:26:30.2015535Z 
2025-05-07T20:26:30.2015540Z 
2025-05-07T20:26:30.2015545Z 
2025-05-07T20:26:30.2015550Z 
2025-05-07T20:26:30.2015555Z 
2025-05-07T20:26:30.2015560Z 
2025-05-07T20:26:30.2015565Z 
2025-05-07T20:26:30.2015570Z 
2025-05-07T20:26:30.2015575Z 
2025-05-07T20:26:30.2015586Z 
2025-05-07T20:26:30.2015732Z 
2025-05-07T20:26:30.2015737Z 
2025-05-07T20:26:30.2016248Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2016458Z 
2025-05-07T20:26:30.2016466Z 
2025-05-07T20:26:30.2017316Z [A
2025-05-07T20:26:30.2017454Z 
2025-05-07T20:26:30.2017458Z 
2025-05-07T20:26:30.2017580Z [A[A
2025-05-07T20:26:30.2017700Z 
2025-05-07T20:26:30.2017704Z 
2025-05-07T20:26:30.2017708Z 
2025-05-07T20:26:30.2018028Z [A[A[A
2025-05-07T20:26:30.2018143Z 
2025-05-07T20:26:30.2018153Z 
2025-05-07T20:26:30.2018157Z 
2025-05-07T20:26:30.2018161Z 
2025-05-07T20:26:30.2018666Z [A[A[A[A
2025-05-07T20:26:30.2018843Z 
2025-05-07T20:26:30.2018849Z 
2025-05-07T20:26:30.2018854Z 
2025-05-07T20:26:30.2018860Z 
2025-05-07T20:26:30.2018884Z 
2025-05-07T20:26:30.2019032Z [A[A[A[A[A
2025-05-07T20:26:30.2019199Z 
2025-05-07T20:26:30.2019211Z 
2025-05-07T20:26:30.2019216Z 
2025-05-07T20:26:30.2019221Z 
2025-05-07T20:26:30.2019226Z 
2025-05-07T20:26:30.2019231Z 
2025-05-07T20:26:30.2019461Z [A[A[A[A[A[A
2025-05-07T20:26:30.2019622Z 
2025-05-07T20:26:30.2019627Z 
2025-05-07T20:26:30.2019633Z 
2025-05-07T20:26:30.2019639Z 
2025-05-07T20:26:30.2019644Z 
2025-05-07T20:26:30.2019649Z 
2025-05-07T20:26:30.2019654Z 
2025-05-07T20:26:30.2020019Z [A[A[A[A[A[A[A
2025-05-07T20:26:30.2020164Z 
2025-05-07T20:26:30.2020168Z 
2025-05-07T20:26:30.2020179Z 
2025-05-07T20:26:30.2020182Z 
2025-05-07T20:26:30.2020186Z 
2025-05-07T20:26:30.2020190Z 
2025-05-07T20:26:30.2020193Z 
2025-05-07T20:26:30.2020197Z 
2025-05-07T20:26:30.2020620Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2020840Z 
2025-05-07T20:26:30.2020845Z 
2025-05-07T20:26:30.2020851Z 
2025-05-07T20:26:30.2020881Z 
2025-05-07T20:26:30.2020886Z 
2025-05-07T20:26:30.2020891Z 
2025-05-07T20:26:30.2020896Z 
2025-05-07T20:26:30.2020900Z 
2025-05-07T20:26:30.2020905Z 
2025-05-07T20:26:30.2021073Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2021285Z 
2025-05-07T20:26:30.2021299Z 
2025-05-07T20:26:30.2021304Z 
2025-05-07T20:26:30.2021437Z 
2025-05-07T20:26:30.2021451Z 
2025-05-07T20:26:30.2021456Z 
2025-05-07T20:26:30.2021461Z 
2025-05-07T20:26:30.2021466Z 
2025-05-07T20:26:30.2021471Z 
2025-05-07T20:26:30.2021476Z 
2025-05-07T20:26:30.2021668Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2021905Z 
2025-05-07T20:26:30.2021910Z 
2025-05-07T20:26:30.2021916Z 
2025-05-07T20:26:30.2021921Z 
2025-05-07T20:26:30.2021926Z 
2025-05-07T20:26:30.2021931Z 
2025-05-07T20:26:30.2021936Z 
2025-05-07T20:26:30.2021941Z 
2025-05-07T20:26:30.2021946Z 
2025-05-07T20:26:30.2021952Z 
2025-05-07T20:26:30.2021956Z 
2025-05-07T20:26:30.2022151Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2022396Z 
2025-05-07T20:26:30.2022401Z 
2025-05-07T20:26:30.2022405Z 
2025-05-07T20:26:30.2022410Z 
2025-05-07T20:26:30.2022415Z 
2025-05-07T20:26:30.2022420Z 
2025-05-07T20:26:30.2022425Z 
2025-05-07T20:26:30.2022430Z 
2025-05-07T20:26:30.2022435Z 
2025-05-07T20:26:30.2022440Z 
2025-05-07T20:26:30.2022445Z 
2025-05-07T20:26:30.2022450Z 
2025-05-07T20:26:30.2022665Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2022919Z 
2025-05-07T20:26:30.2022925Z 
2025-05-07T20:26:30.2022930Z 
2025-05-07T20:26:30.2022935Z 
2025-05-07T20:26:30.2022940Z 
2025-05-07T20:26:30.2022945Z 
2025-05-07T20:26:30.2022950Z 
2025-05-07T20:26:30.2022955Z 
2025-05-07T20:26:30.2022960Z 
2025-05-07T20:26:30.2022973Z 
2025-05-07T20:26:30.2022978Z 
2025-05-07T20:26:30.2022984Z 
2025-05-07T20:26:30.2022989Z 
2025-05-07T20:26:30.2023183Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2023433Z 
2025-05-07T20:26:30.2023438Z 
2025-05-07T20:26:30.2023452Z 
2025-05-07T20:26:30.2023457Z 
2025-05-07T20:26:30.2023462Z 
2025-05-07T20:26:30.2023467Z 
2025-05-07T20:26:30.2023472Z 
2025-05-07T20:26:30.2023478Z 
2025-05-07T20:26:30.2023492Z 
2025-05-07T20:26:30.2023497Z 
2025-05-07T20:26:30.2023502Z 
2025-05-07T20:26:30.2023508Z 
2025-05-07T20:26:30.2023512Z 
2025-05-07T20:26:30.2023517Z 
2025-05-07T20:26:30.2023708Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2023982Z 
2025-05-07T20:26:30.2024090Z 
2025-05-07T20:26:30.2024095Z 
2025-05-07T20:26:30.2024100Z 
2025-05-07T20:26:30.2024105Z 
2025-05-07T20:26:30.2024110Z 
2025-05-07T20:26:30.2024115Z 
2025-05-07T20:26:30.2024120Z 
2025-05-07T20:26:30.2024125Z 
2025-05-07T20:26:30.2024143Z 
2025-05-07T20:26:30.2024148Z 
2025-05-07T20:26:30.2024153Z 
2025-05-07T20:26:30.2024158Z 
2025-05-07T20:26:30.2024163Z 
2025-05-07T20:26:30.2024168Z 
2025-05-07T20:26:30.2024379Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2024653Z 
2025-05-07T20:26:30.2024658Z 
2025-05-07T20:26:30.2024663Z 
2025-05-07T20:26:30.2024668Z 
2025-05-07T20:26:30.2024673Z 
2025-05-07T20:26:30.2024678Z 
2025-05-07T20:26:30.2024683Z 
2025-05-07T20:26:30.2024697Z 
2025-05-07T20:26:30.2024702Z 
2025-05-07T20:26:30.2024707Z 
2025-05-07T20:26:30.2024712Z 
2025-05-07T20:26:30.2024717Z 
2025-05-07T20:26:30.2024722Z 
2025-05-07T20:26:30.2024727Z 
2025-05-07T20:26:30.2024732Z 
2025-05-07T20:26:30.2024737Z 
2025-05-07T20:26:30.2024969Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2025256Z 
2025-05-07T20:26:30.2025261Z 
2025-05-07T20:26:30.2025266Z 
2025-05-07T20:26:30.2025271Z 
2025-05-07T20:26:30.2025277Z 
2025-05-07T20:26:30.2025282Z 
2025-05-07T20:26:30.2025287Z 
2025-05-07T20:26:30.2025292Z 
2025-05-07T20:26:30.2025296Z 
2025-05-07T20:26:30.2025302Z 
2025-05-07T20:26:30.2025306Z 
2025-05-07T20:26:30.2025312Z 
2025-05-07T20:26:30.2025317Z 
2025-05-07T20:26:30.2025322Z 
2025-05-07T20:26:30.2025327Z 
2025-05-07T20:26:30.2025331Z 
2025-05-07T20:26:30.2025337Z 
2025-05-07T20:26:30.2025555Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2025830Z 
2025-05-07T20:26:30.2025835Z 
2025-05-07T20:26:30.2025840Z 
2025-05-07T20:26:30.2025845Z 
2025-05-07T20:26:30.2025850Z 
2025-05-07T20:26:30.2025855Z 
2025-05-07T20:26:30.2025860Z 
2025-05-07T20:26:30.2025865Z 
2025-05-07T20:26:30.2025870Z 
2025-05-07T20:26:30.2025875Z 
2025-05-07T20:26:30.2025899Z 
2025-05-07T20:26:30.2025904Z 
2025-05-07T20:26:30.2025909Z 
2025-05-07T20:26:30.2026006Z 
2025-05-07T20:26:30.2026019Z 
2025-05-07T20:26:30.2026024Z 
2025-05-07T20:26:30.2026029Z 
2025-05-07T20:26:30.2026034Z 
2025-05-07T20:26:30.2026258Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2026549Z 
2025-05-07T20:26:30.2026554Z 
2025-05-07T20:26:30.2026693Z [A
2025-05-07T20:26:30.2026831Z 
2025-05-07T20:26:30.2026836Z 
2025-05-07T20:26:30.2026983Z [A[A
2025-05-07T20:26:30.2027127Z 
2025-05-07T20:26:30.2027132Z 
2025-05-07T20:26:30.2027137Z 
2025-05-07T20:26:30.2027302Z [A[A[A
2025-05-07T20:26:30.2027452Z 
2025-05-07T20:26:30.2027457Z 
2025-05-07T20:26:30.2027462Z 
2025-05-07T20:26:30.2027467Z 
2025-05-07T20:26:30.2027639Z [A[A[A[A
2025-05-07T20:26:30.2027795Z 
2025-05-07T20:26:30.2027800Z 
2025-05-07T20:26:30.2027805Z 
2025-05-07T20:26:30.2027810Z 
2025-05-07T20:26:30.2027815Z 
2025-05-07T20:26:30.2027967Z [A[A[A[A[A
2025-05-07T20:26:30.2028138Z 
2025-05-07T20:26:30.2028144Z 
2025-05-07T20:26:30.2028149Z 
2025-05-07T20:26:30.2028155Z 
2025-05-07T20:26:30.2028168Z 
2025-05-07T20:26:30.2028178Z 
2025-05-07T20:26:30.2028339Z [A[A[A[A[A[A
2025-05-07T20:26:30.2028519Z 
2025-05-07T20:26:30.2028524Z 
2025-05-07T20:26:30.2028529Z 
2025-05-07T20:26:30.2028534Z 
2025-05-07T20:26:30.2028540Z 
2025-05-07T20:26:30.2028544Z 
2025-05-07T20:26:30.2028550Z 
2025-05-07T20:26:30.2028716Z [A[A[A[A[A[A[A
2025-05-07T20:26:30.2028908Z 
2025-05-07T20:26:30.2028913Z 
2025-05-07T20:26:30.2028918Z 
2025-05-07T20:26:30.2028924Z 
2025-05-07T20:26:30.2028929Z 
2025-05-07T20:26:30.2028934Z 
2025-05-07T20:26:30.2028939Z 
2025-05-07T20:26:30.2028944Z 
2025-05-07T20:26:30.2029134Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2029335Z 
2025-05-07T20:26:30.2029340Z 
2025-05-07T20:26:30.2029346Z 
2025-05-07T20:26:30.2029351Z 
2025-05-07T20:26:30.2029357Z 
2025-05-07T20:26:30.2029372Z 
2025-05-07T20:26:30.2029377Z 
2025-05-07T20:26:30.2029382Z 
2025-05-07T20:26:30.2029386Z 
2025-05-07T20:26:30.2029564Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2029772Z 
2025-05-07T20:26:30.2029783Z 
2025-05-07T20:26:30.2029897Z 
2025-05-07T20:26:30.2029902Z 
2025-05-07T20:26:30.2029907Z 
2025-05-07T20:26:30.2029912Z 
2025-05-07T20:26:30.2029917Z 
2025-05-07T20:26:30.2029922Z 
2025-05-07T20:26:30.2029927Z 
2025-05-07T20:26:30.2029932Z 
2025-05-07T20:26:30.2030130Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2030354Z 
2025-05-07T20:26:30.2030359Z 
2025-05-07T20:26:30.2030364Z 
2025-05-07T20:26:30.2030369Z 
2025-05-07T20:26:30.2030374Z 
2025-05-07T20:26:30.2030379Z 
2025-05-07T20:26:30.2030384Z 
2025-05-07T20:26:30.2030389Z 
2025-05-07T20:26:30.2030404Z 
2025-05-07T20:26:30.2030409Z 
2025-05-07T20:26:30.2030414Z 
2025-05-07T20:26:30.2030611Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2030839Z 
2025-05-07T20:26:30.2030844Z 
2025-05-07T20:26:30.2030850Z 
2025-05-07T20:26:30.2030861Z 
2025-05-07T20:26:30.2030867Z 
2025-05-07T20:26:30.2030872Z 
2025-05-07T20:26:30.2030878Z 
2025-05-07T20:26:30.2030883Z 
2025-05-07T20:26:30.2030889Z 
2025-05-07T20:26:30.2030894Z 
2025-05-07T20:26:30.2030906Z 
2025-05-07T20:26:30.2030918Z 
2025-05-07T20:26:30.2031116Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2031387Z 
2025-05-07T20:26:30.2031392Z 
2025-05-07T20:26:30.2031397Z 
2025-05-07T20:26:30.2031402Z 
2025-05-07T20:26:30.2031407Z 
2025-05-07T20:26:30.2031413Z 
2025-05-07T20:26:30.2031418Z 
2025-05-07T20:26:30.2031423Z 
2025-05-07T20:26:30.2031428Z 
2025-05-07T20:26:30.2031433Z 
2025-05-07T20:26:30.2031438Z 
2025-05-07T20:26:30.2031443Z 
2025-05-07T20:26:30.2031449Z 
2025-05-07T20:26:30.2031643Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2031891Z 
2025-05-07T20:26:30.2031896Z 
2025-05-07T20:26:30.2031902Z 
2025-05-07T20:26:30.2031907Z 
2025-05-07T20:26:30.2031912Z 
2025-05-07T20:26:30.2031917Z 
2025-05-07T20:26:30.2031922Z 
2025-05-07T20:26:30.2031927Z 
2025-05-07T20:26:30.2031932Z 
2025-05-07T20:26:30.2031937Z 
2025-05-07T20:26:30.2031943Z 
2025-05-07T20:26:30.2031948Z 
2025-05-07T20:26:30.2031953Z 
2025-05-07T20:26:30.2031958Z 
2025-05-07T20:26:30.2032321Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2032585Z 
2025-05-07T20:26:30.2032590Z 
2025-05-07T20:26:30.2032595Z 
2025-05-07T20:26:30.2032607Z 
2025-05-07T20:26:30.2032612Z 
2025-05-07T20:26:30.2032617Z 
2025-05-07T20:26:30.2032622Z 
2025-05-07T20:26:30.2032627Z 
2025-05-07T20:26:30.2032632Z 
2025-05-07T20:26:30.2032637Z 
2025-05-07T20:26:30.2032642Z 
2025-05-07T20:26:30.2032647Z 
2025-05-07T20:26:30.2032652Z 
2025-05-07T20:26:30.2032657Z 
2025-05-07T20:26:30.2032662Z 
2025-05-07T20:26:30.2032866Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2033135Z 
2025-05-07T20:26:30.2033140Z 
2025-05-07T20:26:30.2033145Z 
2025-05-07T20:26:30.2033150Z 
2025-05-07T20:26:30.2033155Z 
2025-05-07T20:26:30.2033160Z 
2025-05-07T20:26:30.2033165Z 
2025-05-07T20:26:30.2033171Z 
2025-05-07T20:26:30.2033175Z 
2025-05-07T20:26:30.2033181Z 
2025-05-07T20:26:30.2033185Z 
2025-05-07T20:26:30.2033191Z 
2025-05-07T20:26:30.2033196Z 
2025-05-07T20:26:30.2033201Z 
2025-05-07T20:26:30.2033206Z 
2025-05-07T20:26:30.2033217Z 
2025-05-07T20:26:30.2033434Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2033719Z 
2025-05-07T20:26:30.2033724Z 
2025-05-07T20:26:30.2033729Z 
2025-05-07T20:26:30.2033734Z 
2025-05-07T20:26:30.2033739Z 
2025-05-07T20:26:30.2033744Z 
2025-05-07T20:26:30.2033749Z 
2025-05-07T20:26:30.2033763Z 
2025-05-07T20:26:30.2033768Z 
2025-05-07T20:26:30.2033773Z 
2025-05-07T20:26:30.2033778Z 
2025-05-07T20:26:30.2033783Z 
2025-05-07T20:26:30.2033788Z 
2025-05-07T20:26:30.2033793Z 
2025-05-07T20:26:30.2033798Z 
2025-05-07T20:26:30.2033804Z 
2025-05-07T20:26:30.2033809Z 
2025-05-07T20:26:30.2034029Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2034312Z 
2025-05-07T20:26:30.2034317Z 
2025-05-07T20:26:30.2034322Z 
2025-05-07T20:26:30.2034327Z 
2025-05-07T20:26:30.2034332Z 
2025-05-07T20:26:30.2034338Z 
2025-05-07T20:26:30.2034343Z 
2025-05-07T20:26:30.2034348Z 
2025-05-07T20:26:30.2034353Z 
2025-05-07T20:26:30.2034359Z 
2025-05-07T20:26:30.2034363Z 
2025-05-07T20:26:30.2034470Z 
2025-05-07T20:26:30.2034475Z 
2025-05-07T20:26:30.2034481Z 
2025-05-07T20:26:30.2034485Z 
2025-05-07T20:26:30.2034491Z 
2025-05-07T20:26:30.2034495Z 
2025-05-07T20:26:30.2034500Z 
2025-05-07T20:26:30.2034733Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2035013Z 
2025-05-07T20:26:30.2035019Z 
2025-05-07T20:26:30.2035151Z [A
2025-05-07T20:26:30.2035296Z 
2025-05-07T20:26:30.2035301Z 
2025-05-07T20:26:30.2035453Z [A[A
2025-05-07T20:26:30.2035628Z 
2025-05-07T20:26:30.2035633Z 
2025-05-07T20:26:30.2035638Z 
2025-05-07T20:26:30.2035802Z [A[A[A
2025-05-07T20:26:30.2035963Z 
2025-05-07T20:26:30.2035969Z 
2025-05-07T20:26:30.2035974Z 
2025-05-07T20:26:30.2035979Z 
2025-05-07T20:26:30.2036120Z [A[A[A[A
2025-05-07T20:26:30.2036279Z 
2025-05-07T20:26:30.2036285Z 
2025-05-07T20:26:30.2036290Z 
2025-05-07T20:26:30.2036295Z 
2025-05-07T20:26:30.2036308Z 
2025-05-07T20:26:30.2036457Z [A[A[A[A[A
2025-05-07T20:26:30.2036621Z 
2025-05-07T20:26:30.2036632Z 
2025-05-07T20:26:30.2036644Z 
2025-05-07T20:26:30.2036649Z 
2025-05-07T20:26:30.2036655Z 
2025-05-07T20:26:30.2036660Z 
2025-05-07T20:26:30.2036839Z [A[A[A[A[A[A
2025-05-07T20:26:30.2037008Z 
2025-05-07T20:26:30.2037013Z 
2025-05-07T20:26:30.2037019Z 
2025-05-07T20:26:30.2037024Z 
2025-05-07T20:26:30.2037029Z 
2025-05-07T20:26:30.2037041Z 
2025-05-07T20:26:30.2037046Z 
2025-05-07T20:26:30.2037205Z [A[A[A[A[A[A[A
2025-05-07T20:26:30.2037395Z 
2025-05-07T20:26:30.2037400Z 
2025-05-07T20:26:30.2037405Z 
2025-05-07T20:26:30.2037410Z 
2025-05-07T20:26:30.2037415Z 
2025-05-07T20:26:30.2037427Z 
2025-05-07T20:26:30.2037433Z 
2025-05-07T20:26:30.2037438Z 
2025-05-07T20:26:30.2037602Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2037800Z 
2025-05-07T20:26:30.2037805Z 
2025-05-07T20:26:30.2037810Z 
2025-05-07T20:26:30.2037815Z 
2025-05-07T20:26:30.2037820Z 
2025-05-07T20:26:30.2037837Z 
2025-05-07T20:26:30.2037842Z 
2025-05-07T20:26:30.2037847Z 
2025-05-07T20:26:30.2037852Z 
2025-05-07T20:26:30.2038117Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2038333Z 
2025-05-07T20:26:30.2038339Z 
2025-05-07T20:26:30.2038344Z 
2025-05-07T20:26:30.2038358Z 
2025-05-07T20:26:30.2038363Z 
2025-05-07T20:26:30.2038368Z 
2025-05-07T20:26:30.2038373Z 
2025-05-07T20:26:30.2038378Z 
2025-05-07T20:26:30.2038383Z 
2025-05-07T20:26:30.2038388Z 
2025-05-07T20:26:30.2038585Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2038813Z 
2025-05-07T20:26:30.2038818Z 
2025-05-07T20:26:30.2038823Z 
2025-05-07T20:26:30.2038828Z 
2025-05-07T20:26:30.2038833Z 
2025-05-07T20:26:30.2038837Z 
2025-05-07T20:26:30.2038842Z 
2025-05-07T20:26:30.2038847Z 
2025-05-07T20:26:30.2038852Z 
2025-05-07T20:26:30.2038857Z 
2025-05-07T20:26:30.2038862Z 
2025-05-07T20:26:30.2039042Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2039284Z 
2025-05-07T20:26:30.2039289Z 
2025-05-07T20:26:30.2039294Z 
2025-05-07T20:26:30.2039299Z 
2025-05-07T20:26:30.2039304Z 
2025-05-07T20:26:30.2039309Z 
2025-05-07T20:26:30.2039314Z 
2025-05-07T20:26:30.2039320Z 
2025-05-07T20:26:30.2039335Z 
2025-05-07T20:26:30.2039341Z 
2025-05-07T20:26:30.2039346Z 
2025-05-07T20:26:30.2039351Z 
2025-05-07T20:26:30.2039534Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2039773Z 
2025-05-07T20:26:30.2039778Z 
2025-05-07T20:26:30.2039784Z 
2025-05-07T20:26:30.2039789Z 
2025-05-07T20:26:30.2039794Z 
2025-05-07T20:26:30.2039799Z 
2025-05-07T20:26:30.2039804Z 
2025-05-07T20:26:30.2039809Z 
2025-05-07T20:26:30.2039815Z 
2025-05-07T20:26:30.2039820Z 
2025-05-07T20:26:30.2039825Z 
2025-05-07T20:26:30.2039830Z 
2025-05-07T20:26:30.2039835Z 
2025-05-07T20:26:30.2040026Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2040277Z 
2025-05-07T20:26:30.2040282Z 
2025-05-07T20:26:30.2040287Z 
2025-05-07T20:26:30.2040292Z 
2025-05-07T20:26:30.2040297Z 
2025-05-07T20:26:30.2040302Z 
2025-05-07T20:26:30.2040307Z 
2025-05-07T20:26:30.2040340Z 
2025-05-07T20:26:30.2040345Z 
2025-05-07T20:26:30.2040351Z 
2025-05-07T20:26:30.2040356Z 
2025-05-07T20:26:30.2040360Z 
2025-05-07T20:26:30.2040370Z 
2025-05-07T20:26:30.2040474Z 
2025-05-07T20:26:30.2040666Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2040933Z 
2025-05-07T20:26:30.2040938Z 
2025-05-07T20:26:30.2040943Z 
2025-05-07T20:26:30.2040948Z 
2025-05-07T20:26:30.2040953Z 
2025-05-07T20:26:30.2040958Z 
2025-05-07T20:26:30.2040962Z 
2025-05-07T20:26:30.2040968Z 
2025-05-07T20:26:30.2040973Z 
2025-05-07T20:26:30.2040978Z 
2025-05-07T20:26:30.2040983Z 
2025-05-07T20:26:30.2040988Z 
2025-05-07T20:26:30.2040993Z 
2025-05-07T20:26:30.2040998Z 
2025-05-07T20:26:30.2041003Z 
2025-05-07T20:26:30.2041210Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2041477Z 
2025-05-07T20:26:30.2041482Z 
2025-05-07T20:26:30.2041487Z 
2025-05-07T20:26:30.2041492Z 
2025-05-07T20:26:30.2041497Z 
2025-05-07T20:26:30.2041503Z 
2025-05-07T20:26:30.2041508Z 
2025-05-07T20:26:30.2041513Z 
2025-05-07T20:26:30.2041518Z 
2025-05-07T20:26:30.2041529Z 
2025-05-07T20:26:30.2041535Z 
2025-05-07T20:26:30.2041540Z 
2025-05-07T20:26:30.2041545Z 
2025-05-07T20:26:30.2041562Z 
2025-05-07T20:26:30.2041567Z 
2025-05-07T20:26:30.2041572Z 
2025-05-07T20:26:30.2041776Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2042046Z 
2025-05-07T20:26:30.2042059Z 
2025-05-07T20:26:30.2042065Z 
2025-05-07T20:26:30.2042070Z 
2025-05-07T20:26:30.2042075Z 
2025-05-07T20:26:30.2042080Z 
2025-05-07T20:26:30.2042085Z 
2025-05-07T20:26:30.2042089Z 
2025-05-07T20:26:30.2042094Z 
2025-05-07T20:26:30.2042099Z 
2025-05-07T20:26:30.2042104Z 
2025-05-07T20:26:30.2042109Z 
2025-05-07T20:26:30.2042114Z 
2025-05-07T20:26:30.2042119Z 
2025-05-07T20:26:30.2042124Z 
2025-05-07T20:26:30.2042129Z 
2025-05-07T20:26:30.2042134Z 
2025-05-07T20:26:30.2042347Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2042621Z 
2025-05-07T20:26:30.2042627Z 
2025-05-07T20:26:30.2042632Z 
2025-05-07T20:26:30.2042637Z 
2025-05-07T20:26:30.2042642Z 
2025-05-07T20:26:30.2042647Z 
2025-05-07T20:26:30.2042652Z 
2025-05-07T20:26:30.2042657Z 
2025-05-07T20:26:30.2042758Z 
2025-05-07T20:26:30.2042771Z 
2025-05-07T20:26:30.2042776Z 
2025-05-07T20:26:30.2042781Z 
2025-05-07T20:26:30.2042786Z 
2025-05-07T20:26:30.2042791Z 
2025-05-07T20:26:30.2042805Z 
2025-05-07T20:26:30.2042811Z 
2025-05-07T20:26:30.2042816Z 
2025-05-07T20:26:30.2042820Z 
2025-05-07T20:26:30.2043058Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2043350Z 
2025-05-07T20:26:30.2043355Z 
2025-05-07T20:26:30.2043495Z [A
2025-05-07T20:26:30.2043638Z 
2025-05-07T20:26:30.2043643Z 
2025-05-07T20:26:30.2043785Z [A[A
2025-05-07T20:26:30.2043927Z 
2025-05-07T20:26:30.2043932Z 
2025-05-07T20:26:30.2043936Z 
2025-05-07T20:26:30.2044077Z [A[A[A
2025-05-07T20:26:30.2044228Z 
2025-05-07T20:26:30.2044233Z 
2025-05-07T20:26:30.2044238Z 
2025-05-07T20:26:30.2044244Z 
2025-05-07T20:26:30.2044387Z [A[A[A[A
2025-05-07T20:26:30.2044551Z 
2025-05-07T20:26:30.2044556Z 
2025-05-07T20:26:30.2044561Z 
2025-05-07T20:26:30.2044566Z 
2025-05-07T20:26:30.2044571Z 
2025-05-07T20:26:30.2044724Z [A[A[A[A[A
2025-05-07T20:26:30.2044903Z 
2025-05-07T20:26:30.2044908Z 
2025-05-07T20:26:30.2044913Z 
2025-05-07T20:26:30.2044919Z 
2025-05-07T20:26:30.2044924Z 
2025-05-07T20:26:30.2044929Z 
2025-05-07T20:26:30.2045082Z [A[A[A[A[A[A
2025-05-07T20:26:30.2045250Z 
2025-05-07T20:26:30.2045263Z 
2025-05-07T20:26:30.2045268Z 
2025-05-07T20:26:30.2045274Z 
2025-05-07T20:26:30.2045279Z 
2025-05-07T20:26:30.2045284Z 
2025-05-07T20:26:30.2045289Z 
2025-05-07T20:26:30.2045461Z [A[A[A[A[A[A[A
2025-05-07T20:26:30.2045660Z 
2025-05-07T20:26:30.2045666Z 
2025-05-07T20:26:30.2045673Z 
2025-05-07T20:26:30.2045679Z 
2025-05-07T20:26:30.2045686Z 
2025-05-07T20:26:30.2045692Z 
2025-05-07T20:26:30.2045698Z 
2025-05-07T20:26:30.2045705Z 
2025-05-07T20:26:30.2045892Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2046107Z 
2025-05-07T20:26:30.2046112Z 
2025-05-07T20:26:30.2046117Z 
2025-05-07T20:26:30.2046122Z 
2025-05-07T20:26:30.2046127Z 
2025-05-07T20:26:30.2046133Z 
2025-05-07T20:26:30.2046137Z 
2025-05-07T20:26:30.2046148Z 
2025-05-07T20:26:30.2046260Z 
2025-05-07T20:26:30.2046438Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2046661Z 
2025-05-07T20:26:30.2046666Z 
2025-05-07T20:26:30.2046672Z 
2025-05-07T20:26:30.2046677Z 
2025-05-07T20:26:30.2046682Z 
2025-05-07T20:26:30.2046687Z 
2025-05-07T20:26:30.2046692Z 
2025-05-07T20:26:30.2046697Z 
2025-05-07T20:26:30.2046702Z 
2025-05-07T20:26:30.2046707Z 
2025-05-07T20:26:30.2046888Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2047107Z 
2025-05-07T20:26:30.2047112Z 
2025-05-07T20:26:30.2047117Z 
2025-05-07T20:26:30.2047122Z 
2025-05-07T20:26:30.2047127Z 
2025-05-07T20:26:30.2047132Z 
2025-05-07T20:26:30.2047137Z 
2025-05-07T20:26:30.2047142Z 
2025-05-07T20:26:30.2047147Z 
2025-05-07T20:26:30.2047152Z 
2025-05-07T20:26:30.2047157Z 
2025-05-07T20:26:30.2047340Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2047570Z 
2025-05-07T20:26:30.2047575Z 
2025-05-07T20:26:30.2047580Z 
2025-05-07T20:26:30.2047585Z 
2025-05-07T20:26:30.2047590Z 
2025-05-07T20:26:30.2047600Z 
2025-05-07T20:26:30.2047611Z 
2025-05-07T20:26:30.2047617Z 
2025-05-07T20:26:30.2047621Z 
2025-05-07T20:26:30.2047627Z 
2025-05-07T20:26:30.2047639Z 
2025-05-07T20:26:30.2047644Z 
2025-05-07T20:26:30.2047823Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2048061Z 
2025-05-07T20:26:30.2048066Z 
2025-05-07T20:26:30.2048071Z 
2025-05-07T20:26:30.2048076Z 
2025-05-07T20:26:30.2048081Z 
2025-05-07T20:26:30.2048086Z 
2025-05-07T20:26:30.2048099Z 
2025-05-07T20:26:30.2048104Z 
2025-05-07T20:26:30.2048108Z 
2025-05-07T20:26:30.2048138Z 
2025-05-07T20:26:30.2048143Z 
2025-05-07T20:26:30.2048148Z 
2025-05-07T20:26:30.2048154Z 
2025-05-07T20:26:30.2048321Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2048572Z 
2025-05-07T20:26:30.2048577Z 
2025-05-07T20:26:30.2048580Z 
2025-05-07T20:26:30.2048584Z 
2025-05-07T20:26:30.2048587Z 
2025-05-07T20:26:30.2048591Z 
2025-05-07T20:26:30.2048595Z 
2025-05-07T20:26:30.2048598Z 
2025-05-07T20:26:30.2048602Z 
2025-05-07T20:26:30.2048606Z 
2025-05-07T20:26:30.2048704Z 
2025-05-07T20:26:30.2048714Z 
2025-05-07T20:26:30.2048718Z 
2025-05-07T20:26:30.2048722Z 
2025-05-07T20:26:30.2048891Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2049086Z 
2025-05-07T20:26:30.2049089Z 
2025-05-07T20:26:30.2049093Z 
2025-05-07T20:26:30.2049097Z 
2025-05-07T20:26:30.2049100Z 
2025-05-07T20:26:30.2049104Z 
2025-05-07T20:26:30.2049108Z 
2025-05-07T20:26:30.2049111Z 
2025-05-07T20:26:30.2049115Z 
2025-05-07T20:26:30.2049118Z 
2025-05-07T20:26:30.2049129Z 
2025-05-07T20:26:30.2049133Z 
2025-05-07T20:26:30.2049136Z 
2025-05-07T20:26:30.2049140Z 
2025-05-07T20:26:30.2049143Z 
2025-05-07T20:26:30.2049294Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2049490Z 
2025-05-07T20:26:30.2049493Z 
2025-05-07T20:26:30.2049502Z 
2025-05-07T20:26:30.2049506Z 
2025-05-07T20:26:30.2049510Z 
2025-05-07T20:26:30.2049514Z 
2025-05-07T20:26:30.2049517Z 
2025-05-07T20:26:30.2049521Z 
2025-05-07T20:26:30.2049524Z 
2025-05-07T20:26:30.2049528Z 
2025-05-07T20:26:30.2049537Z 
2025-05-07T20:26:30.2049543Z 
2025-05-07T20:26:30.2049546Z 
2025-05-07T20:26:30.2049550Z 
2025-05-07T20:26:30.2049553Z 
2025-05-07T20:26:30.2049557Z 
2025-05-07T20:26:30.2049709Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2049918Z 
2025-05-07T20:26:30.2049922Z 
2025-05-07T20:26:30.2049925Z 
2025-05-07T20:26:30.2049929Z 
2025-05-07T20:26:30.2049932Z 
2025-05-07T20:26:30.2049936Z 
2025-05-07T20:26:30.2049939Z 
2025-05-07T20:26:30.2049943Z 
2025-05-07T20:26:30.2049946Z 
2025-05-07T20:26:30.2049950Z 
2025-05-07T20:26:30.2049954Z 
2025-05-07T20:26:30.2049957Z 
2025-05-07T20:26:30.2049961Z 
2025-05-07T20:26:30.2049964Z 
2025-05-07T20:26:30.2049968Z 
2025-05-07T20:26:30.2049971Z 
2025-05-07T20:26:30.2049975Z 
2025-05-07T20:26:30.2050141Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2050342Z 
2025-05-07T20:26:30.2050346Z 
2025-05-07T20:26:30.2050350Z 
2025-05-07T20:26:30.2050353Z 
2025-05-07T20:26:30.2050357Z 
2025-05-07T20:26:30.2050361Z 
2025-05-07T20:26:30.2050369Z 
2025-05-07T20:26:30.2050451Z 
2025-05-07T20:26:30.2050462Z 
2025-05-07T20:26:30.2050465Z 
2025-05-07T20:26:30.2050469Z 
2025-05-07T20:26:30.2050473Z 
2025-05-07T20:26:30.2050476Z 
2025-05-07T20:26:30.2050480Z 
2025-05-07T20:26:30.2050483Z 
2025-05-07T20:26:30.2050487Z 
2025-05-07T20:26:30.2050490Z 
2025-05-07T20:26:30.2050494Z 
2025-05-07T20:26:30.2050661Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2050939Z 
2025-05-07T20:26:30.2050945Z 
2025-05-07T20:26:30.2051055Z [A
2025-05-07T20:26:30.2051201Z 
2025-05-07T20:26:30.2051206Z 
2025-05-07T20:26:30.2051346Z [A[A
2025-05-07T20:26:30.2051473Z 
2025-05-07T20:26:30.2051479Z 
2025-05-07T20:26:30.2051491Z 
2025-05-07T20:26:30.2051633Z [A[A[A
2025-05-07T20:26:30.2051740Z 
2025-05-07T20:26:30.2051744Z 
2025-05-07T20:26:30.2051751Z 
2025-05-07T20:26:30.2051756Z 
2025-05-07T20:26:30.2051920Z [A[A[A[A
2025-05-07T20:26:30.2052039Z 
2025-05-07T20:26:30.2052042Z 
2025-05-07T20:26:30.2052046Z 
2025-05-07T20:26:30.2052056Z 
2025-05-07T20:26:30.2052065Z 
2025-05-07T20:26:30.2052218Z [A[A[A[A[A
2025-05-07T20:26:30.2052369Z 
2025-05-07T20:26:30.2052373Z 
2025-05-07T20:26:30.2052377Z 
2025-05-07T20:26:30.2052380Z 
2025-05-07T20:26:30.2052384Z 
2025-05-07T20:26:30.2052387Z 
2025-05-07T20:26:30.2052536Z [A[A[A[A[A[A
2025-05-07T20:26:30.2052688Z 
2025-05-07T20:26:30.2052692Z 
2025-05-07T20:26:30.2052696Z 
2025-05-07T20:26:30.2052699Z 
2025-05-07T20:26:30.2052703Z 
2025-05-07T20:26:30.2052706Z 
2025-05-07T20:26:30.2052710Z 
2025-05-07T20:26:30.2052856Z [A[A[A[A[A[A[A
2025-05-07T20:26:30.2053125Z 
2025-05-07T20:26:30.2053129Z 
2025-05-07T20:26:30.2053132Z 
2025-05-07T20:26:30.2053136Z 
2025-05-07T20:26:30.2053140Z 
2025-05-07T20:26:30.2053143Z 
2025-05-07T20:26:30.2053147Z 
2025-05-07T20:26:30.2053150Z 
2025-05-07T20:26:30.2053291Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2053502Z 
2025-05-07T20:26:30.2053506Z 
2025-05-07T20:26:30.2053509Z 
2025-05-07T20:26:30.2053513Z 
2025-05-07T20:26:30.2053516Z 
2025-05-07T20:26:30.2053670Z 
2025-05-07T20:26:30.2053683Z 
2025-05-07T20:26:30.2053689Z 
2025-05-07T20:26:30.2053694Z 
2025-05-07T20:26:30.2053852Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2054012Z 
2025-05-07T20:26:30.2054016Z 
2025-05-07T20:26:30.2054019Z 
2025-05-07T20:26:30.2054023Z 
2025-05-07T20:26:30.2054026Z 
2025-05-07T20:26:30.2054030Z 
2025-05-07T20:26:30.2054034Z 
2025-05-07T20:26:30.2054037Z 
2025-05-07T20:26:30.2054041Z 
2025-05-07T20:26:30.2054044Z 
2025-05-07T20:26:30.2054181Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2054344Z 
2025-05-07T20:26:30.2054348Z 
2025-05-07T20:26:30.2054352Z 
2025-05-07T20:26:30.2054355Z 
2025-05-07T20:26:30.2054359Z 
2025-05-07T20:26:30.2054363Z 
2025-05-07T20:26:30.2054366Z 
2025-05-07T20:26:30.2054370Z 
2025-05-07T20:26:30.2054373Z 
2025-05-07T20:26:30.2054377Z 
2025-05-07T20:26:30.2054381Z 
2025-05-07T20:26:30.2054545Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2054717Z 
2025-05-07T20:26:30.2054729Z 
2025-05-07T20:26:30.2054733Z 
2025-05-07T20:26:30.2054743Z 
2025-05-07T20:26:30.2054750Z 
2025-05-07T20:26:30.2054753Z 
2025-05-07T20:26:30.2054757Z 
2025-05-07T20:26:30.2054760Z 
2025-05-07T20:26:30.2054764Z 
2025-05-07T20:26:30.2054768Z 
2025-05-07T20:26:30.2054771Z 
2025-05-07T20:26:30.2054775Z 
2025-05-07T20:26:30.2054924Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2055185Z 
2025-05-07T20:26:30.2055190Z 
2025-05-07T20:26:30.2055195Z 
2025-05-07T20:26:30.2055201Z 
2025-05-07T20:26:30.2055206Z 
2025-05-07T20:26:30.2055211Z 
2025-05-07T20:26:30.2055217Z 
2025-05-07T20:26:30.2055220Z 
2025-05-07T20:26:30.2055224Z 
2025-05-07T20:26:30.2055228Z 
2025-05-07T20:26:30.2055231Z 
2025-05-07T20:26:30.2055235Z 
2025-05-07T20:26:30.2055238Z 
2025-05-07T20:26:30.2055386Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2055568Z 
2025-05-07T20:26:30.2055572Z 
2025-05-07T20:26:30.2055576Z 
2025-05-07T20:26:30.2055579Z 
2025-05-07T20:26:30.2055583Z 
2025-05-07T20:26:30.2055587Z 
2025-05-07T20:26:30.2055590Z 
2025-05-07T20:26:30.2055594Z 
2025-05-07T20:26:30.2055602Z 
2025-05-07T20:26:30.2055710Z 
2025-05-07T20:26:30.2055715Z 
2025-05-07T20:26:30.2055720Z 
2025-05-07T20:26:30.2055725Z 
2025-05-07T20:26:30.2055730Z 
2025-05-07T20:26:30.2055937Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2056185Z 
2025-05-07T20:26:30.2056189Z 
2025-05-07T20:26:30.2056192Z 
2025-05-07T20:26:30.2056196Z 
2025-05-07T20:26:30.2056199Z 
2025-05-07T20:26:30.2056203Z 
2025-05-07T20:26:30.2056206Z 
2025-05-07T20:26:30.2056217Z 
2025-05-07T20:26:30.2056220Z 
2025-05-07T20:26:30.2056224Z 
2025-05-07T20:26:30.2056227Z 
2025-05-07T20:26:30.2056231Z 
2025-05-07T20:26:30.2056234Z 
2025-05-07T20:26:30.2056238Z 
2025-05-07T20:26:30.2056242Z 
2025-05-07T20:26:30.2056400Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2056599Z 
2025-05-07T20:26:30.2056603Z 
2025-05-07T20:26:30.2056607Z 
2025-05-07T20:26:30.2056610Z 
2025-05-07T20:26:30.2056614Z 
2025-05-07T20:26:30.2056617Z 
2025-05-07T20:26:30.2056621Z 
2025-05-07T20:26:30.2056624Z 
2025-05-07T20:26:30.2056633Z 
2025-05-07T20:26:30.2056642Z 
2025-05-07T20:26:30.2056646Z 
2025-05-07T20:26:30.2056649Z 
2025-05-07T20:26:30.2056653Z 
2025-05-07T20:26:30.2056657Z 
2025-05-07T20:26:30.2056660Z 
2025-05-07T20:26:30.2056664Z 
2025-05-07T20:26:30.2056821Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2057016Z 
2025-05-07T20:26:30.2057020Z 
2025-05-07T20:26:30.2057023Z 
2025-05-07T20:26:30.2057027Z 
2025-05-07T20:26:30.2057031Z 
2025-05-07T20:26:30.2057034Z 
2025-05-07T20:26:30.2057038Z 
2025-05-07T20:26:30.2057041Z 
2025-05-07T20:26:30.2057045Z 
2025-05-07T20:26:30.2057048Z 
2025-05-07T20:26:30.2057052Z 
2025-05-07T20:26:30.2057055Z 
2025-05-07T20:26:30.2057059Z 
2025-05-07T20:26:30.2057063Z 
2025-05-07T20:26:30.2057066Z 
2025-05-07T20:26:30.2057076Z 
2025-05-07T20:26:30.2057079Z 
2025-05-07T20:26:30.2057233Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2057436Z 
2025-05-07T20:26:30.2057440Z 
2025-05-07T20:26:30.2057443Z 
2025-05-07T20:26:30.2057447Z 
2025-05-07T20:26:30.2057553Z 
2025-05-07T20:26:30.2057572Z 
2025-05-07T20:26:30.2057576Z 
2025-05-07T20:26:30.2057579Z 
2025-05-07T20:26:30.2057583Z 
2025-05-07T20:26:30.2057586Z 
2025-05-07T20:26:30.2057590Z 
2025-05-07T20:26:30.2057594Z 
2025-05-07T20:26:30.2057597Z 
2025-05-07T20:26:30.2057601Z 
2025-05-07T20:26:30.2057604Z 
2025-05-07T20:26:30.2057608Z 
2025-05-07T20:26:30.2057611Z 
2025-05-07T20:26:30.2057615Z 
2025-05-07T20:26:30.2057778Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2057985Z 
2025-05-07T20:26:30.2057989Z 
2025-05-07T20:26:30.2058088Z [A
2025-05-07T20:26:30.2058195Z 
2025-05-07T20:26:30.2058199Z 
2025-05-07T20:26:30.2058309Z [A[A
2025-05-07T20:26:30.2058415Z 
2025-05-07T20:26:30.2058419Z 
2025-05-07T20:26:30.2058422Z 
2025-05-07T20:26:30.2058528Z [A[A[A
2025-05-07T20:26:30.2058635Z 
2025-05-07T20:26:30.2058638Z 
2025-05-07T20:26:30.2058642Z 
2025-05-07T20:26:30.2058646Z 
2025-05-07T20:26:30.2058753Z [A[A[A[A
2025-05-07T20:26:30.2058866Z 
2025-05-07T20:26:30.2058875Z 
2025-05-07T20:26:30.2058882Z 
2025-05-07T20:26:30.2058886Z 
2025-05-07T20:26:30.2058889Z 
2025-05-07T20:26:30.2059005Z [A[A[A[A[A
2025-05-07T20:26:30.2059125Z 
2025-05-07T20:26:30.2059129Z 
2025-05-07T20:26:30.2059132Z 
2025-05-07T20:26:30.2059136Z 
2025-05-07T20:26:30.2059141Z 
2025-05-07T20:26:30.2059144Z 
2025-05-07T20:26:30.2060368Z [A[A[A[A[A[A done
2025-05-07T20:26:30.4095251Z Preparing transaction: \ | done
2025-05-07T20:26:31.5188387Z Verifying transaction: - \ | / - \ | / - \ | done
2025-05-07T20:26:32.1257232Z Executing transaction: - \ | / - \ done
2025-05-07T20:26:34.2974674Z [INSTALL] Fixing file placements for CUDA 12.6.3+ ...
2025-05-07T20:26:34.2975091Z [INSTALL] Creating symlinks: libnvToolsExt.so
2025-05-07T20:26:34.2975773Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:34.2976317Z 
2025-05-07T20:26:34.2988517Z 
2025-05-07T20:26:34.2989227Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:34.2990281Z 
2025-05-07T20:26:34.3002804Z 
2025-05-07T20:26:34.3003115Z [INSTALL] Copying nvtx3 headers ...
2025-05-07T20:26:34.3008044Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/
2025-05-07T20:26:34.3011769Z 
2025-05-07T20:26:34.4648979Z 
2025-05-07T20:26:34.4654369Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/
2025-05-07T20:26:34.4658085Z 
2025-05-07T20:26:34.4679454Z 
2025-05-07T20:26:34.4679803Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ...
2025-05-07T20:26:34.5052952Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ...
2025-05-07T20:26:36.3928665Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error)
2025-05-07T20:26:36.4560400Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs
2025-05-07T20:26:36.4560991Z 
2025-05-07T20:26:36.8772540Z 
2025-05-07T20:26:36.8780893Z [INSTALL] Setting environment variable NVML_LIB_PATH ...
2025-05-07T20:26:36.9130407Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:36.9130933Z 
2025-05-07T20:26:37.3449707Z 
2025-05-07T20:26:37.3450120Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ...
2025-05-07T20:26:37.3451024Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/"
2025-05-07T20:26:37.3451726Z 
2025-05-07T20:26:37.7687765Z 
2025-05-07T20:26:39.7859802Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h
2025-05-07T20:26:41.8020353Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so
2025-05-07T20:26:43.8203979Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:43.8205104Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:45.8428408Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:47.7295654Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc
2025-05-07T20:26:47.7295960Z 
2025-05-07T20:26:47.7942021Z [CHECK] Binary nvcc found in PATH
2025-05-07T20:26:51.6422949Z /tmp/tmp4nff3e7b: line 3: clang: command not found
2025-05-07T20:26:51.6423247Z 
2025-05-07T20:26:51.6423769Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error)
2025-05-07T20:26:51.7067495Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d
2025-05-07T20:26:51.7067814Z 
2025-05-07T20:26:51.7088351Z total 36
2025-05-07T20:26:51.7088703Z drwxr-xr-x. 2 ec2-user ec2-user   191 May  7 20:26 .
2025-05-07T20:26:51.7089103Z drwxr-xr-x. 5 ec2-user ec2-user    62 May  7 20:25 ..
2025-05-07T20:26:51.7089563Z -rw-r--r--. 2 ec2-user ec2-user  3778 Jun 10  2024 activate-binutils_linux-64.sh
2025-05-07T20:26:51.7090058Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10  2024 activate-gcc_linux-64.sh
2025-05-07T20:26:51.7090532Z -rw-r--r--. 2 ec2-user ec2-user  5190 Jun 10  2024 activate-gxx_linux-64.sh
2025-05-07T20:26:51.7090985Z -rw-r--r--. 2 ec2-user ec2-user   136 Mar 27 01:27 libglib_activate.sh
2025-05-07T20:26:51.7091414Z -rw-r--r--. 2 ec2-user ec2-user   872 Nov 13 09:20 libxml2_activate.sh
2025-05-07T20:26:51.7091856Z -rw-r--r--. 2 ec2-user ec2-user  2932 Nov 20 20:32 ~cuda-nvcc_activate.sh
2025-05-07T20:26:51.7092145Z 
2025-05-07T20:26:51.7092351Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ...
2025-05-07T20:26:51.7092972Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh
2025-05-07T20:26:51.7093506Z 
2025-05-07T20:26:51.7114798Z 
2025-05-07T20:26:51.7115542Z + conda run -n build_binary c++ --version | grep -i clang
2025-05-07T20:26:51.7115822Z 
2025-05-07T20:26:53.6637376Z 
2025-05-07T20:26:53.6637989Z [BUILD] Setting prepend flags for NVCC ...
2025-05-07T20:26:53.6638534Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler"
2025-05-07T20:26:53.6638908Z 
2025-05-07T20:26:54.0896471Z 
2025-05-07T20:26:54.0897117Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS
2025-05-07T20:26:54.0897413Z 
2025-05-07T20:26:55.9770967Z -allow-unsupported-compiler
2025-05-07T20:26:55.9771225Z 
2025-05-07T20:26:56.0420363Z 
2025-05-07T20:26:56.0420825Z [INFO] Printing out all preprocessor defines in nvcc ...
2025-05-07T20:26:56.0421311Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null
2025-05-07T20:26:56.0421633Z 
2025-05-07T20:26:58.0047709Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead")))
2025-05-07T20:26:58.0048346Z #define M_PIl 3.141592653589793238462643383279502884L
2025-05-07T20:26:58.0048706Z #define _IO_CURRENTLY_PUTTING 0x800
2025-05-07T20:26:58.0049041Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig))
2025-05-07T20:26:58.0049379Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:26:58.0049662Z #define _STL_PAIR_H 1
2025-05-07T20:26:58.0049918Z #define __cpp_attributes 200809L
2025-05-07T20:26:58.0050263Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:26:58.0050657Z #define __DELETE_THROW throw()
2025-05-07T20:26:58.0050927Z #define _PTRDIFF_T_ 
2025-05-07T20:26:58.0051186Z #define M_PI_4 0.78539816339744830962
2025-05-07T20:26:58.0051475Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:26:58.0051744Z #define _IO_LEFT 02
2025-05-07T20:26:58.0051966Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:26:58.0052224Z #define _POSIX2_BC_SCALE_MAX 99
2025-05-07T20:26:58.0052497Z #define _GLIBCXX_USE_RANDOM_TR1 1
2025-05-07T20:26:58.0052908Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp)
2025-05-07T20:26:58.0053430Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:26:58.0055581Z #define RE_DUP_MAX (0x7fff)
2025-05-07T20:26:58.0055837Z #define _IOS_OUTPUT 2
2025-05-07T20:26:58.0056170Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:26:58.0056536Z #define toascii_l(c,l) __toascii_l ((c), (l))
2025-05-07T20:26:58.0056840Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:26:58.0057110Z #define _GLIBCXX_USE_FCHMOD 1
2025-05-07T20:26:58.0057385Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:26:58.0058149Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; }))
2025-05-07T20:26:58.0058938Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:26:58.0059530Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:26:58.0059838Z #define cudaTextureTypeCubemapLayered 0xFC
2025-05-07T20:26:58.0060151Z #define _T_WCHAR_ 
2025-05-07T20:26:58.0060370Z #define stdout stdout
2025-05-07T20:26:58.0060707Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11")))
2025-05-07T20:26:58.0061082Z #define CHAR_BIT __CHAR_BIT__
2025-05-07T20:26:58.0061336Z #define __flexarr []
2025-05-07T20:26:58.0061577Z #define _GLIBCXX_HAVE_FINITEF 1
2025-05-07T20:26:58.0061901Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l))
2025-05-07T20:26:58.0062239Z #define _IO_FLAGS2_USER_WBUF 8
2025-05-07T20:26:58.0062492Z #define _MATH_H 1
2025-05-07T20:26:58.0062768Z #define cudaOccupancyDisableCachingOverride 0x01
2025-05-07T20:26:58.0063101Z #define __S64_TYPE long int
2025-05-07T20:26:58.0063355Z #define __stub_fchflags 
2025-05-07T20:26:58.0063621Z #define cudaDeviceScheduleMask 0x07
2025-05-07T20:26:58.0063909Z #define __SQUAD_TYPE long int
2025-05-07T20:26:58.0064178Z #define __INTMAX_C(c) c ## L
2025-05-07T20:26:58.0064441Z #define _BSD_SIZE_T_DEFINED_ 
2025-05-07T20:26:58.0064703Z #define NL_NMAX INT_MAX
2025-05-07T20:26:58.0065099Z #define _BITS_TIME_H 1
2025-05-07T20:26:58.0065385Z #define M_LN10l 2.302585092994045684017991454684364208L
2025-05-07T20:26:58.0065720Z #define _GLIBCXX_TXN_SAFE_DYN 
2025-05-07T20:26:58.0066058Z #define cudaStreamTailLaunch ((cudaStream_t)0x3)
2025-05-07T20:26:58.0066407Z #define M_El 2.718281828459045235360287471352662498L
2025-05-07T20:26:58.0066803Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd)
2025-05-07T20:26:58.0067156Z #define __CHAR_BIT__ 8
2025-05-07T20:26:58.0067422Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:58.0067732Z #define _PSTL_STRING_CONCAT(x,y) x #y
2025-05-07T20:26:58.0068022Z #define _GLIBCXX98_USE_C99_MATH 1
2025-05-07T20:26:58.0068287Z #define FP_NAN 0
2025-05-07T20:26:58.0068548Z #define makedev(maj,min) gnu_dev_makedev (maj, min)
2025-05-07T20:26:58.0068981Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 
2025-05-07T20:26:58.0069461Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2
2025-05-07T20:26:58.0069850Z #define __cudaCDP2GetErrorString 
2025-05-07T20:26:58.0070139Z #define SHRT_MAX __SHRT_MAX__
2025-05-07T20:26:58.0070396Z #define _GLIBCXX_X86_RDSEED 1
2025-05-07T20:26:58.0070650Z #define __SM_80_RT_H__ 
2025-05-07T20:26:58.0070875Z #define _NEW 
2025-05-07T20:26:58.0071096Z #define CLOCK_PROCESS_CPUTIME_ID 2
2025-05-07T20:26:58.0071374Z #define __UINT8_MAX__ 0xff
2025-05-07T20:26:58.0071738Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition)
2025-05-07T20:26:58.0072129Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:26:58.0072372Z #define __USE_ANSI 1
2025-05-07T20:26:58.0072660Z #define _IO_BE(expr,res) __builtin_expect ((expr), res)
2025-05-07T20:26:58.0073049Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l))
2025-05-07T20:26:58.0073397Z #define __cudaCDP2Memcpy2DAsync_ptsz 
2025-05-07T20:26:58.0073693Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:26:58.0073972Z #define __SIZEOF_PTHREAD_ATTR_T 56
2025-05-07T20:26:58.0074248Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:26:58.0074534Z #define _GLIBCXX_END_NAMESPACE_LDBL 
2025-05-07T20:26:58.0074945Z #define PIPE_BUF 4096
2025-05-07T20:26:58.0075259Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 
2025-05-07T20:26:58.0075618Z #define ADJ_TICK 0x4000
2025-05-07T20:26:58.0075929Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10)
2025-05-07T20:26:58.0076261Z #define MQ_PRIO_MAX 32768
2025-05-07T20:26:58.0076529Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4
2025-05-07T20:26:58.0076848Z #define __WAIT_INT(status) (*(int *) &(status))
2025-05-07T20:26:58.0077294Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:58.0077815Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01
2025-05-07T20:26:58.0078177Z #define _XOPEN_SOURCE 700
2025-05-07T20:26:58.0078439Z #define _POSIX2_BC_DIM_MAX 2048
2025-05-07T20:26:58.0078710Z #define __VECTOR_FUNCTIONS_HPP__ 
2025-05-07T20:26:58.0078996Z #define __cpp_static_assert 201411L
2025-05-07T20:26:58.0079337Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8)
2025-05-07T20:26:58.0079683Z #define _GLIBCXX_HAVE_STRXFRM_L 1
2025-05-07T20:26:58.0079968Z #define _POSIX_TTY_NAME_MAX 9
2025-05-07T20:26:58.0080246Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__
2025-05-07T20:26:58.0080542Z #define __OFF_T_MATCHES_OFF64_T 1
2025-05-07T20:26:58.0080827Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:26:58.0081129Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:58.0081485Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l))
2025-05-07T20:26:58.0081837Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:26:58.0082120Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1
2025-05-07T20:26:58.0082429Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:58.0082787Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l))
2025-05-07T20:26:58.0083139Z #define cudaNvSciSyncAttrSignal 0x1
2025-05-07T20:26:58.0083428Z #define _GLIBCXX_USE_LONG_LONG 1
2025-05-07T20:26:58.0083726Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:26:58.0084137Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:26:58.0084466Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:26:58.0094930Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:26:58.0095412Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:26:58.0095763Z #define ADJ_ESTERROR 0x0008
2025-05-07T20:26:58.0096056Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:26:58.0096347Z #define __GCC_IEC_559 2
2025-05-07T20:26:58.0096644Z #define __cpp_lib_transformation_trait_aliases 201304
2025-05-07T20:26:58.0096978Z #define _IO_flockfile(_fp) 
2025-05-07T20:26:58.0097248Z #define CLOCK_MONOTONIC_RAW 4
2025-05-07T20:26:58.0097523Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:26:58.0097783Z #define _IOFBF 0
2025-05-07T20:26:58.0098012Z #define __USE_BSD 1
2025-05-07T20:26:58.0098246Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:26:58.0098527Z #define SHRT_MIN (-SHRT_MAX - 1)
2025-05-07T20:26:58.0098794Z #define _IO_USER_LOCK 0x8000
2025-05-07T20:26:58.0099055Z #define _IO_NO_WRITES 8
2025-05-07T20:26:58.0099320Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 
2025-05-07T20:26:58.0099672Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname
2025-05-07T20:26:58.0100014Z #define _GLIBCXX_HAVE_SYS_STAT_H 1
2025-05-07T20:26:58.0100320Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ())
2025-05-07T20:26:58.0100641Z #define __cpp_binary_literals 201304L
2025-05-07T20:26:58.0100928Z #define _CPP_TYPE_TRAITS_H 1
2025-05-07T20:26:58.0101191Z #define __BEGIN_NAMESPACE_C99 
2025-05-07T20:26:58.0101457Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:26:58.0101764Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 
2025-05-07T20:26:58.0102139Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE)
2025-05-07T20:26:58.0102500Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:26:58.0102805Z #define M_PI 3.14159265358979323846
2025-05-07T20:26:58.0103105Z #define _GLIBCXX_PACKAGE_NAME "package-unused"
2025-05-07T20:26:58.0103432Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1
2025-05-07T20:26:58.0103859Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:26:58.0104149Z #define _POSIX_DELAYTIMER_MAX 32
2025-05-07T20:26:58.0104424Z #define _GLIBCXX_USE_UTIME 1
2025-05-07T20:26:58.0104691Z #define _STL_ITERATOR_BASE_FUNCS_H 1
2025-05-07T20:26:58.0105271Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr)
2025-05-07T20:26:58.0105876Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1
2025-05-07T20:26:58.0106223Z #define w_termsig __wait_terminated.__w_termsig
2025-05-07T20:26:58.0106539Z #define __FLOAT_WORD_ORDER __BYTE_ORDER
2025-05-07T20:26:58.0106832Z #define __cudaCDP2GetErrorName 
2025-05-07T20:26:58.0107111Z #define XATTR_SIZE_MAX 65536
2025-05-07T20:26:58.0107380Z #define be64toh(x) __bswap_64 (x)
2025-05-07T20:26:58.0107673Z #define __ASSERT_VOID_CAST static_cast<void>
2025-05-07T20:26:58.0108005Z #define __cpp_variadic_templates 200704L
2025-05-07T20:26:58.0108313Z #define RAND_MAX 2147483647
2025-05-07T20:26:58.0108598Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1
2025-05-07T20:26:58.0108918Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:58.0109236Z #define __SM_90_RT_H__ 
2025-05-07T20:26:58.0109483Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:26:58.0109742Z #define __COMPAR_FN_T 
2025-05-07T20:26:58.0109989Z #define __GID_T_TYPE __U32_TYPE
2025-05-07T20:26:58.0110255Z #define _IO_BAD_SEEN 0x4000
2025-05-07T20:26:58.0110724Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x)))
2025-05-07T20:26:58.0111230Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:26:58.0111575Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 
2025-05-07T20:26:58.0111927Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:26:58.0112226Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:58.0112563Z #define cudaArrayColorAttachment 0x20
2025-05-07T20:26:58.0112874Z #define __cpp_variable_templates 201304L
2025-05-07T20:26:58.0113463Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:58.0114005Z #define __cpp_lib_integral_constant_callable 201304
2025-05-07T20:26:58.0114337Z #define _GLIBCXX_HAVE_SINHF 1
2025-05-07T20:26:58.0114607Z #define MOD_TIMECONST ADJ_TIMECONST
2025-05-07T20:26:58.0114906Z #define __cpp_lib_result_of_sfinae 201210
2025-05-07T20:26:58.0115208Z #define __SM_30_INTRINSICS_H__ 
2025-05-07T20:26:58.0115470Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:26:58.0115738Z #define _GLIBCXX_USE_WCHAR_T 1
2025-05-07T20:26:58.0116030Z #define _GLIBCXX_MATH_H 1
2025-05-07T20:26:58.0116301Z #define __u_char_defined 
2025-05-07T20:26:58.0116620Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status))
2025-05-07T20:26:58.0116977Z #define STA_PPSERROR 0x0800
2025-05-07T20:26:58.0117235Z #define _GLIBCXX_STD_A std
2025-05-07T20:26:58.0117491Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:26:58.0117774Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 
2025-05-07T20:26:58.0118205Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type)
2025-05-07T20:26:58.0118621Z #define FP_INFINITE 1
2025-05-07T20:26:58.0118986Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:58.0119395Z #define _IO_pid_t __pid_t
2025-05-07T20:26:58.0119644Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:26:58.0119900Z #define __LEAF , __leaf__
2025-05-07T20:26:58.0120140Z #define PATH_MAX 4096
2025-05-07T20:26:58.0120387Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:26:58.0120716Z #define __LDBL_REDIR1(name,proto,alias) name proto
2025-05-07T20:26:58.0121029Z #define _LIMITS_H___ 
2025-05-07T20:26:58.0121256Z #define __size_t 
2025-05-07T20:26:58.0121556Z #define _GLIBCXX_HAVE_FREXPF 1
2025-05-07T20:26:58.0122134Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK)
2025-05-07T20:26:58.0122696Z #define _GLIBCXX_HAVE_FREXPL 1
2025-05-07T20:26:58.0123005Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:26:58.0123429Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:26:58.0123695Z #define _WCHAR_T_DEFINED 
2025-05-07T20:26:58.0124048Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 
2025-05-07T20:26:58.0124444Z #define MOD_STATUS ADJ_STATUS
2025-05-07T20:26:58.0124744Z #define _GLIBCXX_PURE __attribute__ ((__pure__))
2025-05-07T20:26:58.0125343Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status))
2025-05-07T20:26:58.0125718Z #define _GLIBCXX_HAVE_STDINT_H 1
2025-05-07T20:26:58.0126010Z #define __SIZEOF_PTHREAD_CONDATTR_T 4
2025-05-07T20:26:58.0126294Z #define __INT8_C(c) c
2025-05-07T20:26:58.0126552Z #define __cudaCDP2GetParameterBuffer 
2025-05-07T20:26:58.0126854Z #define _GLIBCXX_HAVE_COSHF 1
2025-05-07T20:26:58.0127121Z #define _GLIBCXX_HAVE_COSHL 1
2025-05-07T20:26:58.0127377Z #define __SM_70_RT_HPP__ 
2025-05-07T20:26:58.0127632Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:26:58.0127915Z #define __cpp_variadic_using 201611L
2025-05-07T20:26:58.0128238Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:58.0128564Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:26:58.0128843Z #define __SM_61_INTRINSICS_HPP__ 
2025-05-07T20:26:58.0129112Z #define _IO_FLAGS2_MMAP 1
2025-05-07T20:26:58.0129378Z #define __cpp_capture_star_this 201603L
2025-05-07T20:26:58.0129693Z #define __cudaCDP2LaunchDeviceV2_ptsz 
2025-05-07T20:26:58.0129998Z #define _GLIBCXX_HAVE_ENDIAN_H 1
2025-05-07T20:26:58.0130350Z #define __always_inline __inline __attribute__ ((__always_inline__))
2025-05-07T20:26:58.0130727Z #define NFDBITS __NFDBITS
2025-05-07T20:26:58.0130989Z #define _PSTL_PRAGMA_FORCEINLINE 
2025-05-07T20:26:58.0131275Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1
2025-05-07T20:26:58.0131604Z #define __glibcxx_requires_sorted(_First,_Last) 
2025-05-07T20:26:58.0131930Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:26:58.0132188Z #define _GLIBCXX_SYMVER_GNU 1
2025-05-07T20:26:58.0132597Z #define w_stopval __wait_stopped.__w_stopval
2025-05-07T20:26:58.0132913Z #define STA_UNSYNC 0x0040
2025-05-07T20:26:58.0133328Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:58.0133748Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX
2025-05-07T20:26:58.0134116Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:26:58.0134414Z #define __cpp_if_constexpr 201606L
2025-05-07T20:26:58.0134729Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 
2025-05-07T20:26:58.0135101Z #define cudaStreamFireAndForget ((cudaStream_t)0x4)
2025-05-07T20:26:58.0135444Z #define _GLIBCXX_HAVE_WCHAR_H 1
2025-05-07T20:26:58.0135760Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO
2025-05-07T20:26:58.0136150Z #define __daddr_t_defined 
2025-05-07T20:26:58.0136411Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:26:58.0136685Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1
2025-05-07T20:26:58.0137006Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1
2025-05-07T20:26:58.0137523Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800))
2025-05-07T20:26:58.0138012Z #define _ACRTIMP 
2025-05-07T20:26:58.0138236Z #define _IO_EOF_SEEN 0x10
2025-05-07T20:26:58.0138510Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1
2025-05-07T20:26:58.0138807Z #define _IOS_BIN 128
2025-05-07T20:26:58.0139160Z #define __fortify_function __extern_always_inline __attribute_artificial__
2025-05-07T20:26:58.0139580Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:58.0139860Z #define UNDERFLOW 4
2025-05-07T20:26:58.0140086Z #define NAME_MAX 255
2025-05-07T20:26:58.0140335Z #define SCHAR_MAX __SCHAR_MAX__
2025-05-07T20:26:58.0140614Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:26:58.0140896Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:26:58.0141199Z #define _IO_UNIFIED_JUMPTABLES 1
2025-05-07T20:26:58.0141576Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:26:58.0141958Z #define __ptr_t void *
2025-05-07T20:26:58.0142206Z #define M_E 2.7182818284590452354
2025-05-07T20:26:58.0142588Z #define cudaSurfaceType1D 0x01
2025-05-07T20:26:58.0142863Z #define __USE_ISOCXX11 1
2025-05-07T20:26:58.0143132Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:26:58.0143454Z #define cudaDeviceBlockingSync 0x04
2025-05-07T20:26:58.0143758Z #define CLOCK_MONOTONIC_COARSE 6
2025-05-07T20:26:58.0144034Z #define _GLIBCXX_OS_DEFINES 1
2025-05-07T20:26:58.0144328Z #define _GLIBCXX_NODISCARD [[__nodiscard__]]
2025-05-07T20:26:58.0144647Z #define cudaSurfaceType2D 0x02
2025-05-07T20:26:58.0144906Z #define __linux 1
2025-05-07T20:26:58.0145142Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:26:58.0145423Z #define cudaDeviceMask 0xff
2025-05-07T20:26:58.0145693Z #define _GLIBCXX_END_NAMESPACE_ALGO 
2025-05-07T20:26:58.0146017Z #define __CUDA_API_VER_MAJOR__ 12
2025-05-07T20:26:58.0146328Z #define htobe16(x) __bswap_16 (x)
2025-05-07T20:26:58.0146618Z #define HUGE_VALF (__builtin_huge_valf())
2025-05-07T20:26:58.0146934Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:26:58.0147250Z #define HUGE_VALL (__builtin_huge_vall())
2025-05-07T20:26:58.0147554Z #define _BITS_TYPES_H 1
2025-05-07T20:26:58.0147841Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL)
2025-05-07T20:26:58.0148184Z #define _IO_cleanup_region_end(_Doit) 
2025-05-07T20:26:58.0148496Z #define cudaSurfaceType3D 0x03
2025-05-07T20:26:58.0148775Z #define _GLIBCXX_HAVE_SYS_TIME_H 1
2025-05-07T20:26:58.0149075Z #define __cudaGet_blockIdx() blockIdx
2025-05-07T20:26:58.0149375Z #define _IO_DONT_CLOSE 0100000
2025-05-07T20:26:58.0150148Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib)
2025-05-07T20:26:58.0150951Z #define cudaHostRegisterDefault 0x00
2025-05-07T20:26:58.0151238Z #define __unix 1
2025-05-07T20:26:58.0151467Z #define MATH_ERRNO 1
2025-05-07T20:26:58.0151721Z #define _GLIBCXX_STDIO_SEEK_END 2
2025-05-07T20:26:58.0152006Z #define _GLIBCXX_USE_FCHMODAT 1
2025-05-07T20:26:58.0152377Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:26:58.0152682Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:26:58.0152978Z #define __UID_T_TYPE __U32_TYPE
2025-05-07T20:26:58.0153271Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1
2025-05-07T20:26:58.0153737Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10))
2025-05-07T20:26:58.0154197Z #define __nv_pure__ __location__(nv_pure)
2025-05-07T20:26:58.0154500Z #define CUDARTAPI_CDECL 
2025-05-07T20:26:58.0154766Z #define _PSTL_USAGE_WARNINGS 0
2025-05-07T20:26:58.0155047Z #define _GLIBCXX98_USE_C99_COMPLEX 1
2025-05-07T20:26:58.0155337Z #define __cpp_lib_void_t 201411
2025-05-07T20:26:58.0155606Z #define _POSIX_AIO_MAX 1
2025-05-07T20:26:58.0155851Z #define __SIZE_T 
2025-05-07T20:26:58.0156104Z #define isgraph_l(c,l) __isgraph_l ((c), (l))
2025-05-07T20:26:58.0156426Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0
2025-05-07T20:26:58.0156729Z #define _POSIX_PIPE_BUF 512
2025-05-07T20:26:58.0156998Z #define _GLIBCXX_HAVE_STRTOLD 1
2025-05-07T20:26:58.0157275Z #define _ATFILE_SOURCE 1
2025-05-07T20:26:58.0157666Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false)
2025-05-07T20:26:58.0158090Z #define __WAIT_STATUS void *
2025-05-07T20:26:58.0158363Z #define __MATH_FUNCTIONS_H__ 
2025-05-07T20:26:58.0158643Z #define _GLIBCXX_HAVE_WCSTOF 1
2025-05-07T20:26:58.0158912Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:26:58.0159388Z #define _GLIBCXX_HAVE_LC_MESSAGES 1
2025-05-07T20:26:58.0159744Z #define __WINT_MIN__ 0U
2025-05-07T20:26:58.0160314Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L)
2025-05-07T20:26:58.0160943Z #define isdigit_l(c,l) __isdigit_l ((c), (l))
2025-05-07T20:26:58.0161241Z #define WUNTRACED 2
2025-05-07T20:26:58.0161472Z #define _GLIBCXX_HAVE_SQRTF 1
2025-05-07T20:26:58.0161744Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8
2025-05-07T20:26:58.0162033Z #define NZERO 20
2025-05-07T20:26:58.0162460Z #define _GLIBCXX_HAVE_MEMALIGN 1
2025-05-07T20:26:58.0162728Z #define _PSTL_PRAGMA(x) _Pragma(#x)
2025-05-07T20:26:58.0163015Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT
2025-05-07T20:26:58.0163300Z #define MOD_CLKB ADJ_TICK
2025-05-07T20:26:58.0163545Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:26:58.0163826Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:26:58.0164098Z #define __DEVICE_FUNCTIONS_H__ 
2025-05-07T20:26:58.0164370Z #define SCHAR_MIN (-SCHAR_MAX - 1)
2025-05-07T20:26:58.0164631Z #define EXIT_FAILURE 1
2025-05-07T20:26:58.0164868Z #define ADJ_MAXERROR 0x0004
2025-05-07T20:26:58.0165128Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:26:58.0165385Z #define _SIZE_T_DEFINED_ 
2025-05-07T20:26:58.0165636Z #define _POSIX_AIO_LISTIO_MAX 2
2025-05-07T20:26:58.0165934Z #define __cudaCDP2DeviceGetLimit 
2025-05-07T20:26:58.0166285Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW
2025-05-07T20:26:58.0166640Z #define __cudaCDP2FuncGetAttributes 
2025-05-07T20:26:58.0166942Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:26:58.0167184Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:26:58.0167457Z #define __USING_NAMESPACE_STD(name) 
2025-05-07T20:26:58.0167750Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1
2025-05-07T20:26:58.0168045Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:26:58.0168330Z #define SEEK_DATA 3
2025-05-07T20:26:58.0168560Z #define __KERNEL_STRICT_NAMES 
2025-05-07T20:26:58.0168851Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_))
2025-05-07T20:26:58.0169256Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0)
2025-05-07T20:26:58.0169636Z #define _FUNCTEXCEPT_H 1
2025-05-07T20:26:58.0169880Z #define __INT64_C(c) c ## L
2025-05-07T20:26:58.0170141Z #define __NTH(fct) __LEAF_ATTR fct throw ()
2025-05-07T20:26:58.0170467Z #define _GLIBCXX_CONST __attribute__ ((__const__))
2025-05-07T20:26:58.0170785Z #define _GLIBCXX_HAVE_LINK 1
2025-05-07T20:26:58.0171052Z #define cudaNvSciSyncAttrWait 0x2
2025-05-07T20:26:58.0171470Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:26:58.0171774Z #define STA_PPSWANDER 0x0400
2025-05-07T20:26:58.0172017Z #define __INT_WCHAR_T_H 
2025-05-07T20:26:58.0172248Z #define WSTOPPED 2
2025-05-07T20:26:58.0172481Z #define _POSIX_THREAD_THREADS_MAX 64
2025-05-07T20:26:58.0172753Z #define _POSIX_MQ_OPEN_MAX 8
2025-05-07T20:26:58.0172998Z #define FP_NORMAL 4
2025-05-07T20:26:58.0173296Z #define __cudaCDP2LaunchDevice_ptsz 
2025-05-07T20:26:58.0173570Z #define _BITS_TIMEX_H 1
2025-05-07T20:26:58.0173794Z #define _POSIX_LINK_MAX 8
2025-05-07T20:26:58.0174043Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1
2025-05-07T20:26:58.0174312Z #define _GLIBCXX_HAVE_ATAN2F 1
2025-05-07T20:26:58.0174573Z #define cudaTextureType1D 0x01
2025-05-07T20:26:58.0174834Z #define _GLIBCXX_HAVE_ATAN2L 1
2025-05-07T20:26:58.0175085Z #define COLL_WEIGHTS_MAX 255
2025-05-07T20:26:58.0175345Z #define __isascii(c) (((c) & ~0x7f) == 0)
2025-05-07T20:26:58.0175630Z #define __toascii(c) ((c) & 0x7f)
2025-05-07T20:26:58.0176095Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b)))
2025-05-07T20:26:58.0176539Z #define _IO_MAGIC 0xFBAD0000
2025-05-07T20:26:58.0176795Z #define _GLIBCXX_USE_SENDFILE 1
2025-05-07T20:26:58.0177043Z #define _POSIX_SOURCE 1
2025-05-07T20:26:58.0177278Z #define cudaTextureType2D 0x02
2025-05-07T20:26:58.0177530Z #define _PTR_TRAITS_H 1
2025-05-07T20:26:58.0177791Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE)
2025-05-07T20:26:58.0178097Z #define _GLIBCXX_HAVE_POWF 1
2025-05-07T20:26:58.0178356Z #define _POSIX2_BC_STRING_MAX 1000
2025-05-07T20:26:58.0178666Z #define __attribute_used__ __attribute__ ((__used__))
2025-05-07T20:26:58.0178991Z #define cudaTextureType3D 0x03
2025-05-07T20:26:58.0179248Z #define _STDIO_USES_IOSTREAM 
2025-05-07T20:26:58.0179501Z #define CLOCK_REALTIME 0
2025-05-07T20:26:58.0179751Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:26:58.0180013Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:26:58.0180309Z #define __cpp_aligned_new 201606L
2025-05-07T20:26:58.0180582Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:26:58.0180937Z #define cudaEventBlockingSync 0x01
2025-05-07T20:26:58.0181217Z #define _GLIBCXX_HAVE_TANL 1
2025-05-07T20:26:58.0181479Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1
2025-05-07T20:26:58.0181772Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1
2025-05-07T20:26:58.0182057Z #define _GLIBCXX_USE_C99_FENV_TR1 1
2025-05-07T20:26:58.0182330Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:26:58.0182574Z #define __GLIBC__ 2
2025-05-07T20:26:58.0182778Z #define __END_DECLS }
2025-05-07T20:26:58.0183013Z #define FP_ILOGB0 (-2147483647 - 1)
2025-05-07T20:26:58.0183366Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:26:58.0183724Z #define __CONCAT(x,y) x ## y
2025-05-07T20:26:58.0183973Z #define WCONTINUED 8
2025-05-07T20:26:58.0184196Z #define __STDC_HOSTED__ 1
2025-05-07T20:26:58.0184440Z #define _GLIBCXX_HAVE_ARPA_INET_H 1
2025-05-07T20:26:58.0184704Z #define _ALLOCA_H 1
2025-05-07T20:26:58.0184940Z #define __host__ __location__(host)
2025-05-07T20:26:58.0185351Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg)))
2025-05-07T20:26:58.0185774Z #define __SLONG32_TYPE int
2025-05-07T20:26:58.0186034Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1
2025-05-07T20:26:58.0186304Z #define _SYS_SELECT_H 1
2025-05-07T20:26:58.0186535Z #define _IO_LINE_BUF 0x200
2025-05-07T20:26:58.0186777Z #define _IOS_NOCREATE 32
2025-05-07T20:26:58.0187021Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:26:58.0187287Z #define __cudaGet_warpSize() warpSize
2025-05-07T20:26:58.0187570Z #define __SSIZE_T_TYPE __SWORD_TYPE
2025-05-07T20:26:58.0187845Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0
2025-05-07T20:26:58.0188117Z #define __global__ __location__(global)
2025-05-07T20:26:58.0188395Z #define __GNU_LIBRARY__ 6
2025-05-07T20:26:58.0188643Z #define __cpp_decltype_auto 201304L
2025-05-07T20:26:58.0188903Z #define __DBL_DIG__ 15
2025-05-07T20:26:58.0189130Z #define TIME_UTC 1
2025-05-07T20:26:58.0189427Z #define __FLT32_DIG__ 6
2025-05-07T20:26:58.0189742Z #define __forceinline__ __inline__ __attribute__((always_inline))
2025-05-07T20:26:58.0190127Z #define cudaHostAllocWriteCombined 0x04
2025-05-07T20:26:58.0190550Z #define cudaDeviceScheduleAuto 0x00
2025-05-07T20:26:58.0190871Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l))
2025-05-07T20:26:58.0191158Z #define _G_BUFSIZ 8192
2025-05-07T20:26:58.0191457Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:26:58.0191819Z #define cudaTextureTypeCubemap 0x0C
2025-05-07T20:26:58.0192134Z #define __cudaCDP2GetDevice 
2025-05-07T20:26:58.0192463Z #define __cudaCDP2PeekAtLastError 
2025-05-07T20:26:58.0192744Z #define STA_CLOCKERR 0x1000
2025-05-07T20:26:58.0192981Z #define __GXX_WEAK__ 1
2025-05-07T20:26:58.0193230Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:58.0193526Z #define _GLIBCXX_HAVE_ISNANF 1
2025-05-07T20:26:58.0193775Z #define __SHRT_WIDTH__ 16
2025-05-07T20:26:58.0194064Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304
2025-05-07T20:26:58.0194397Z #define _GLIBCXX_BITS_SPECFUN_H 1
2025-05-07T20:26:58.0194668Z #define _GLIBCXX_HAVE_ISNANL 1
2025-05-07T20:26:58.0194947Z #define isblank_l(c,l) __isblank_l ((c), (l))
2025-05-07T20:26:58.0195233Z #define _G_config_h 1
2025-05-07T20:26:58.0195502Z #define M_LOG2El 1.442695040888963407359924681001892137L
2025-05-07T20:26:58.0195845Z #define ADJ_OFFSET_SINGLESHOT 0x8001
2025-05-07T20:26:58.0196149Z #define _GCC_WCHAR_T 
2025-05-07T20:26:58.0196373Z #define TMP_MAX 238328
2025-05-07T20:26:58.0196600Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:26:58.0196862Z #define __DEVICE_TYPES_H__ 
2025-05-07T20:26:58.0197119Z #define __DEV_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:58.0197386Z #define _EXT_NUMERIC_TRAITS 1
2025-05-07T20:26:58.0197655Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 
2025-05-07T20:26:58.0197928Z #define _IO_SKIPWS 01
2025-05-07T20:26:58.0198314Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000
2025-05-07T20:26:58.0198755Z #define _IO_SCIENTIFIC 04000
2025-05-07T20:26:58.0199140Z #define _GLIBCXX_HAVE_STRING_H 1
2025-05-07T20:26:58.0199458Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:26:58.0199816Z #define cudaDeviceScheduleSpin 0x01
2025-05-07T20:26:58.0200175Z #define __nonnull(params) __attribute__ ((__nonnull__ params))
2025-05-07T20:26:58.0200530Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:26:58.0200777Z #define le32toh(x) (x)
2025-05-07T20:26:58.0201009Z #define _SIZE_T_DEFINED 
2025-05-07T20:26:58.0201258Z #define _GLIBCXX_HAVE_XLOCALE_H 1
2025-05-07T20:26:58.0201587Z #define cudaArraySparsePropertiesSingleMipTail 0x1
2025-05-07T20:26:58.0201932Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:26:58.0202325Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0)
2025-05-07T20:26:58.0202722Z #define _GLIBCXX_HAVE_FMODL 1
2025-05-07T20:26:58.0202988Z #define _GLIBCXX_HAVE_POLL 1
2025-05-07T20:26:58.0203249Z #define __SM_32_INTRINSICS_H__ 
2025-05-07T20:26:58.0203511Z #define _POSIX_NAME_MAX 14
2025-05-07T20:26:58.0203796Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:26:58.0204303Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter)
2025-05-07T20:26:58.0204794Z #define _GLIBCXX_USE_CLOCK_REALTIME 1
2025-05-07T20:26:58.0205096Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:26:58.0205439Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG)
2025-05-07T20:26:58.0205751Z #define _WCHAR_T_ 
2025-05-07T20:26:58.0205974Z #define _GLIBCXX_FAST_MATH 0
2025-05-07T20:26:58.0206387Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:26:58.0206770Z #define RTSIG_MAX 32
2025-05-07T20:26:58.0206991Z #define _STDDEF_H 
2025-05-07T20:26:58.0207221Z #define CU_UUID_HAS_BEEN_DEFINED 
2025-05-07T20:26:58.0207490Z #define _VA_LIST_DEFINED 
2025-05-07T20:26:58.0207741Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:26:58.0208066Z #define __glibcxx_requires_non_empty_range(_First,_Last) 
2025-05-07T20:26:58.0208536Z #define __grid_constant__ __location__(grid_constant)
2025-05-07T20:26:58.0208869Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:26:58.0209153Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" {
2025-05-07T20:26:58.0209608Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L)
2025-05-07T20:26:58.0210128Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B))
2025-05-07T20:26:58.0210489Z #define __SIZEOF_PTHREAD_COND_T 48
2025-05-07T20:26:58.0210808Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 
2025-05-07T20:26:58.0211118Z #define __unix__ 1
2025-05-07T20:26:58.0211355Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:58.0211641Z #define __INT_WIDTH__ 32
2025-05-07T20:26:58.0211892Z #define __SIZEOF_LONG__ 8
2025-05-07T20:26:58.0212136Z #define _IONBF 2
2025-05-07T20:26:58.0212571Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib)
2025-05-07T20:26:58.0213415Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++)
2025-05-07T20:26:58.0261417Z #define __STDC_IEC_559__ 1
2025-05-07T20:26:58.0261711Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:26:58.0261979Z #define __UINT16_C(c) c
2025-05-07T20:26:58.0262232Z #define M_2_PI 0.63661977236758134308
2025-05-07T20:26:58.0262566Z #define STA_DEL 0x0020
2025-05-07T20:26:58.0262858Z #define __CUDACC_VER_MINOR__ 6
2025-05-07T20:26:58.0263114Z #define __id_t_defined 
2025-05-07T20:26:58.0263379Z #define w_retcode __wait_terminated.__w_retcode
2025-05-07T20:26:58.0263822Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base)
2025-05-07T20:26:58.0264242Z #define _GLIBCXX_HAVE_MODFF 1
2025-05-07T20:26:58.0264500Z #define _GLIBCXX_HAVE_MODFL 1
2025-05-07T20:26:58.0264752Z #define __DECIMAL_DIG__ 21
2025-05-07T20:26:58.0264995Z #define _POSIX2_RE_DUP_MAX 255
2025-05-07T20:26:58.0265245Z #define __USE_FORTIFY_LEVEL 0
2025-05-07T20:26:58.0265517Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:26:58.0266041Z #define SING 2
2025-05-07T20:26:58.0266249Z #define STA_FREQHOLD 0x0080
2025-05-07T20:26:58.0266509Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:58.0266798Z #define cudaStreamDefault 0x00
2025-05-07T20:26:58.0267133Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:26:58.0267499Z #define _GLIBCXX_HAVE_HYPOTL 1
2025-05-07T20:26:58.0267770Z #define _GLIBCXX_HAVE_SYS_UIO_H 1
2025-05-07T20:26:58.0268026Z #define __gnu_linux__ 1
2025-05-07T20:26:58.0268265Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:26:58.0268580Z #define _LARGEFILE_SOURCE 1
2025-05-07T20:26:58.0268822Z #define MAX_INPUT 255
2025-05-07T20:26:58.0269058Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:26:58.0269381Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l))
2025-05-07T20:26:58.0269748Z #define __glibcxx_requires_heap(_First,_Last) 
2025-05-07T20:26:58.0270059Z #define _GLIBCXX_CPU_DEFINES 1
2025-05-07T20:26:58.0270327Z #define _GLIBCXX_HAVE_POLL_H 1
2025-05-07T20:26:58.0270719Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__))
2025-05-07T20:26:58.0271136Z #define _IO_SHOWPOS 02000
2025-05-07T20:26:58.0271462Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1
2025-05-07T20:26:58.0271814Z #define _Mfloat_ float
2025-05-07T20:26:58.0272071Z #define __glibcxx_requires_cond(_Cond,_Msg) 
2025-05-07T20:26:58.0272377Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:26:58.0272661Z #define DELAYTIMER_MAX 2147483647
2025-05-07T20:26:58.0273135Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0)
2025-05-07T20:26:58.0273648Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:58.0273934Z #define _GLIBCXX98_USE_C99_STDIO 1
2025-05-07T20:26:58.0274266Z #define cudaKernelNodeAttrID cudaLaunchAttributeID
2025-05-07T20:26:58.0274623Z #define __glibcxx_class_requires2(_a,_b,_c) 
2025-05-07T20:26:58.0275061Z #define __USE_ISOC11 1
2025-05-07T20:26:58.0275305Z #define _BSD_SIZE_T_ 
2025-05-07T20:26:58.0275542Z #define ADJ_MICRO 0x1000
2025-05-07T20:26:58.0275797Z #define _GLIBCXX_HAVE_FABSF 1
2025-05-07T20:26:58.0276056Z #define _GLIBCXX_HAVE_FABSL 1
2025-05-07T20:26:58.0276358Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd)
2025-05-07T20:26:58.0276674Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:26:58.0276987Z #define __attribute_const__ __attribute__ ((__const__))
2025-05-07T20:26:58.0277317Z #define __THROW throw ()
2025-05-07T20:26:58.0277568Z #define __cudaGet_gridDim() gridDim
2025-05-07T20:26:58.0277858Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:58.0278213Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 
2025-05-07T20:26:58.0278559Z #define htobe32(x) __bswap_32 (x)
2025-05-07T20:26:58.0278832Z #define _GLIBCXX_HAVE_POWL 1
2025-05-07T20:26:58.0279097Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:26:58.0279358Z #define __GLIBC_HAVE_LONG_LONG 1
2025-05-07T20:26:58.0279627Z #define L_tmpnam 20
2025-05-07T20:26:58.0279857Z #define ___int_wchar_t_h 
2025-05-07T20:26:58.0280196Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status))
2025-05-07T20:26:58.0280573Z #define isascii(c) __isascii (c)
2025-05-07T20:26:58.0280833Z #define _T_PTRDIFF 
2025-05-07T20:26:58.0281143Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp)
2025-05-07T20:26:58.0281493Z #define toascii(c) __toascii (c)
2025-05-07T20:26:58.0281751Z #define __GNUC__ 11
2025-05-07T20:26:58.0282006Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE
2025-05-07T20:26:58.0282298Z #define __GXX_RTTI 1
2025-05-07T20:26:58.0282523Z #define __pie__ 2
2025-05-07T20:26:58.0282736Z #define __MMX__ 1
2025-05-07T20:26:58.0282953Z #define __cudaCDP2Malloc 
2025-05-07T20:26:58.0283209Z #define __timespec_defined 1
2025-05-07T20:26:58.0283471Z #define L_ctermid 9
2025-05-07T20:26:58.0283703Z #define __OFF64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:58.0284009Z #define __cudaCDP2GetParameterBufferV2 
2025-05-07T20:26:58.0284404Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER)
2025-05-07T20:26:58.0284853Z #define _BITS_POSIX2_LIM_H 1
2025-05-07T20:26:58.0285125Z #define _GLIBCXX98_USE_C99_STDLIB 1
2025-05-07T20:26:58.0285421Z #define cudaMemAttachGlobal 0x01
2025-05-07T20:26:58.0285729Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp)
2025-05-07T20:26:58.0286041Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:26:58.0286308Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:26:58.0286749Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1)
2025-05-07T20:26:58.0287484Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:58.0288085Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE
2025-05-07T20:26:58.0288390Z #define __USE_SVID 1
2025-05-07T20:26:58.0288645Z #define __constant__ __location__(constant)
2025-05-07T20:26:58.0288960Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1
2025-05-07T20:26:58.0289265Z #define __device__ __location__(device)
2025-05-07T20:26:58.0289597Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1
2025-05-07T20:26:58.0289913Z #define _GLIBCXX_RES_LIMITS 1
2025-05-07T20:26:58.0290181Z #define M_1_PI 0.31830988618379067154
2025-05-07T20:26:58.0290461Z #define CUDART_DEVICE __device__
2025-05-07T20:26:58.0290801Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW
2025-05-07T20:26:58.0291164Z #define M_PI_2 1.57079632679489661923
2025-05-07T20:26:58.0291456Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:26:58.0291818Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02
2025-05-07T20:26:58.0292195Z #define __STDC_UTF_16__ 1
2025-05-07T20:26:58.0292448Z #define LONG_MAX __LONG_MAX__
2025-05-07T20:26:58.0292831Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136)
2025-05-07T20:26:58.0293414Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4
2025-05-07T20:26:58.0293724Z #define _POSIX_HOST_NAME_MAX 255
2025-05-07T20:26:58.0294087Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:26:58.0294371Z #define NGROUPS_MAX 65536
2025-05-07T20:26:58.0294643Z #define _GLIBCXX_NAMESPACE_LDBL 
2025-05-07T20:26:58.0294925Z #define __USE_ISOC95 1
2025-05-07T20:26:58.0295165Z #define _TIME_H 1
2025-05-07T20:26:58.0295512Z #define M_LOG10El 0.434294481903251827651128918916605082L
2025-05-07T20:26:58.0295846Z #define __USE_ISOC99 1
2025-05-07T20:26:58.0296199Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname)
2025-05-07T20:26:58.0296553Z #define HOST_NAME_MAX 64
2025-05-07T20:26:58.0296798Z #define _POSIX_SEM_NSEMS_MAX 256
2025-05-07T20:26:58.0297053Z #define _IOS_ATEND 4
2025-05-07T20:26:58.0297282Z #define __SM_35_INTRINSICS_H__ 
2025-05-07T20:26:58.0297598Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status))
2025-05-07T20:26:58.0297996Z #define cudaStreamAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:58.0298327Z #define _GLIBCXX_HAVE_S_ISREG 1
2025-05-07T20:26:58.0298601Z #define cudaSurfaceTypeCubemap 0x0C
2025-05-07T20:26:58.0298919Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:26:58.0299224Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:26:58.0299473Z #define _STDIO_H 1
2025-05-07T20:26:58.0299872Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type)
2025-05-07T20:26:58.0300324Z #define _GLIBCXX_PREDEFINED_OPS_H 1
2025-05-07T20:26:58.0300677Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:58.0301047Z #define _G_IO_IO_FILE_VERSION 0x20001
2025-05-07T20:26:58.0301324Z #define _POSIX_SIGQUEUE_MAX 32
2025-05-07T20:26:58.0301586Z #define _GLIBCXX_HAVE_GETS 1
2025-05-07T20:26:58.0301850Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1
2025-05-07T20:26:58.0302136Z #define __cpp_raw_strings 200710L
2025-05-07T20:26:58.0302426Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:58.0302734Z #define _GLIBCXX_HAVE_VFWSCANF 1
2025-05-07T20:26:58.0302997Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:26:58.0303267Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L
2025-05-07T20:26:58.0303568Z #define _GLIBCXX_STDIO_EOF -1
2025-05-07T20:26:58.0303938Z #define __SIZEOF_PTHREAD_MUTEX_T 40
2025-05-07T20:26:58.0304213Z #define __CHANNEL_DESCRIPTOR_H__ 
2025-05-07T20:26:58.0304569Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
2025-05-07T20:26:58.0304930Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:26:58.0305161Z #define __USE_XOPEN 1
2025-05-07T20:26:58.0305405Z #define __SIZEOF_PTHREAD_RWLOCK_T 56
2025-05-07T20:26:58.0305838Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:58.0306319Z #define __USE_XOPEN2K 1
2025-05-07T20:26:58.0306553Z #define _PSTL_UDR_PRESENT 1
2025-05-07T20:26:58.0306823Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:26:58.0307110Z #define _GLIBCXX_HAVE_COSF 1
2025-05-07T20:26:58.0307371Z #define __cpp_fold_expressions 201603L
2025-05-07T20:26:58.0307881Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2)
2025-05-07T20:26:58.0308397Z #define NL_LANGMAX _POSIX2_LINE_MAX
2025-05-07T20:26:58.0308674Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:26:58.0309027Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 
2025-05-07T20:26:58.0309406Z #define __DADDR_T_TYPE __S32_TYPE
2025-05-07T20:26:58.0309773Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01
2025-05-07T20:26:58.0310161Z #define __END_NAMESPACE_C99 
2025-05-07T20:26:58.0310427Z #define __glibcxx_integral_traps true
2025-05-07T20:26:58.0310706Z #define _POSIX_PATH_MAX 256
2025-05-07T20:26:58.0310949Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:26:58.0311202Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:26:58.0311463Z #define _ISOC11_SOURCE 1
2025-05-07T20:26:58.0311704Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1
2025-05-07T20:26:58.0311989Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:26:58.0312282Z #define _GLIBCXX_HAVE_QUICK_EXIT 1
2025-05-07T20:26:58.0312637Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 
2025-05-07T20:26:58.0313092Z #define LONG_MIN (-LONG_MAX - 1L)
2025-05-07T20:26:58.0313368Z #define _GLIBCXX_HAVE_SINCOSF 1
2025-05-07T20:26:58.0313617Z #define _IO_UNITBUF 020000
2025-05-07T20:26:58.0313862Z #define _GLIBCXX_HAVE_SINCOSL 1
2025-05-07T20:26:58.0314126Z #define __FD_SETSIZE 1024
2025-05-07T20:26:58.0314367Z #define getc(_fp) _IO_getc (_fp)
2025-05-07T20:26:58.0314638Z #define be32toh(x) __bswap_32 (x)
2025-05-07T20:26:58.0314977Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused"
2025-05-07T20:26:58.0315326Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:26:58.0315579Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:26:58.0315887Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l))
2025-05-07T20:26:58.0316200Z #define _GLIBCXX_HAVE_GETIPINFO 1
2025-05-07T20:26:58.0316461Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:26:58.0316770Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l))
2025-05-07T20:26:58.0317098Z #define _WCHAR_T_DEFINED_ 
2025-05-07T20:26:58.0317374Z #define cudaIpcMemLazyEnablePeerAccess 0x01
2025-05-07T20:26:58.0317700Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1
2025-05-07T20:26:58.0317987Z #define __INO_T_MATCHES_INO64_T 1
2025-05-07T20:26:58.0318248Z #define __USE_POSIX199506 1
2025-05-07T20:26:58.0318494Z #define _FEATURES_H 1
2025-05-07T20:26:58.0318727Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:26:58.0319115Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM))
2025-05-07T20:26:58.0319515Z #define __stub_getmsg 
2025-05-07T20:26:58.0319742Z #define _IO_FIXED 010000
2025-05-07T20:26:58.0320017Z #define __cpp_lib_addressof_constexpr 201603
2025-05-07T20:26:58.0320318Z #define _GLIBCXX11_USE_C99_STDIO 1
2025-05-07T20:26:58.0320581Z #define __stub_setlogin 
2025-05-07T20:26:58.0320817Z #define __stub_fattach 
2025-05-07T20:26:58.0321046Z #define __cplusplus 201703L
2025-05-07T20:26:58.0321307Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:26:58.0321582Z #define _STRUCT_TIMEVAL 1
2025-05-07T20:26:58.0321828Z #define INFINITY (__builtin_inff())
2025-05-07T20:26:58.0322104Z #define _IO_UNBUFFERED 2
2025-05-07T20:26:58.0322666Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy
2025-05-07T20:26:58.0323174Z #define _IO_INTERNAL 010
2025-05-07T20:26:58.0323414Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:26:58.0323741Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:58.0324093Z #define __dev_t_defined 
2025-05-07T20:26:58.0324324Z #define __DEPRECATED 1
2025-05-07T20:26:58.0324549Z #define __S32_TYPE int
2025-05-07T20:26:58.0324796Z #define __cpp_rvalue_references 200610L
2025-05-07T20:26:58.0325081Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:26:58.0325332Z #define _IO_fpos_t _G_fpos_t
2025-05-07T20:26:58.0325581Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:26:58.0326169Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout
2025-05-07T20:26:58.0326796Z #define _G_HAVE_MREMAP 1
2025-05-07T20:26:58.0327106Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:58.0327437Z #define OVERFLOW 3
2025-05-07T20:26:58.0327690Z #define __toascii_l(c,l) ((l), __toascii (c))
2025-05-07T20:26:58.0328086Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:26:58.0328374Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:58.0328699Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11
2025-05-07T20:26:58.0329022Z #define __SSE2_MATH__ 1
2025-05-07T20:26:58.0329261Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:26:58.0329557Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:58.0329855Z #define _IO_STDIO_H 
2025-05-07T20:26:58.0330102Z #define PDP_ENDIAN __PDP_ENDIAN
2025-05-07T20:26:58.0330412Z #define isspace_l(c,l) __isspace_l ((c), (l))
2025-05-07T20:26:58.0330759Z #define __cudaCDP2Memcpy2DAsync 
2025-05-07T20:26:58.0331082Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:58.0331418Z #define _GLIBCXX_HAVE_STRERROR_R 1
2025-05-07T20:26:58.0331707Z #define __amd64 1
2025-05-07T20:26:58.0331944Z #define _POSIX_TZNAME_MAX 6
2025-05-07T20:26:58.0332343Z #define __cudaCDP2Memset3DAsync 
2025-05-07T20:26:58.0332628Z #define __SYSCALL_WORDSIZE 64
2025-05-07T20:26:58.0332913Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1
2025-05-07T20:26:58.0333297Z #define _EXT_TYPE_TRAITS 1
2025-05-07T20:26:58.0333552Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1
2025-05-07T20:26:58.0333844Z #define _POSIX_RE_DUP_MAX 255
2025-05-07T20:26:58.0334105Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:26:58.0334346Z #define __bounded 
2025-05-07T20:26:58.0334575Z #define __USECONDS_T_TYPE __U32_TYPE
2025-05-07T20:26:58.0334857Z #define _IO_DELETE_DONT_CLOSE 0x40
2025-05-07T20:26:58.0335128Z #define __BEGIN_NAMESPACE_STD 
2025-05-07T20:26:58.0335387Z #define _PTRDIFF_T_DECLARED 
2025-05-07T20:26:58.0335654Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:58.0335985Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f)
2025-05-07T20:26:58.0336412Z #define cudaStreamAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:58.0336813Z #define _GLIBCXX_HAVE_NETDB_H 1
2025-05-07T20:26:58.0337086Z #define __SM_20_INTRINSICS_HPP__ 
2025-05-07T20:26:58.0337414Z #define __cpp_lib_has_unique_object_representations 201606
2025-05-07T20:26:58.0337746Z #define STA_PLL 0x0001
2025-05-07T20:26:58.0337986Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:26:58.0338242Z #define __GNUG__ 11
2025-05-07T20:26:58.0338466Z #define _GLIBCXX_USE_GET_NPROCS 1
2025-05-07T20:26:58.0338723Z #define _T_WCHAR 
2025-05-07T20:26:58.0338950Z #define __cudaCDP2GetDeviceCount 
2025-05-07T20:26:58.0339245Z #define __specialization_static 
2025-05-07T20:26:58.0339540Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:26:58.0339838Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:26:58.0340091Z #define cudaArraySparse 0x40
2025-05-07T20:26:58.0340353Z #define STA_PPSFREQ 0x0002
2025-05-07T20:26:58.0340592Z #define __GLIBCXX__ 20230528
2025-05-07T20:26:58.0340867Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_))
2025-05-07T20:26:58.0341163Z #define _WCHAR_T 
2025-05-07T20:26:58.0341375Z #define __cudaCDP2Free 
2025-05-07T20:26:58.0342095Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0)
2025-05-07T20:26:58.0342761Z #define __cpp_nsdmi 200809L
2025-05-07T20:26:58.0343167Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0)
2025-05-07T20:26:58.0343591Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:26:58.0343860Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:26:58.0344116Z #define cudaArrayCubemap 0x04
2025-05-07T20:26:58.0344439Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:58.0344787Z #define _GLIBCXX_UTILITY 1
2025-05-07T20:26:58.0345025Z #define __NO_CTYPE 1
2025-05-07T20:26:58.0345246Z #define __stub_bdflush 
2025-05-07T20:26:58.0345593Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter)
2025-05-07T20:26:58.0346038Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 
2025-05-07T20:26:58.0346369Z #define _GLIBCXX_STDC_HEADERS 1
2025-05-07T20:26:58.0346629Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:26:58.0346896Z #define __cpp_initializer_lists 200806L
2025-05-07T20:26:58.0347191Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1
2025-05-07T20:26:58.0347477Z #define __U16_TYPE unsigned short int
2025-05-07T20:26:58.0347808Z #define __glibcxx_requires_can_increment(_First,_Size) 
2025-05-07T20:26:58.0348142Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1
2025-05-07T20:26:58.0348416Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:26:58.0348694Z #define cudaHostRegisterIoMemory 0x04
2025-05-07T20:26:58.0349027Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS))
2025-05-07T20:26:58.0349366Z #define __cpp_lib_is_invocable 201703
2025-05-07T20:26:58.0349635Z #define _IO_STDIO 040000
2025-05-07T20:26:58.0349952Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int)))
2025-05-07T20:26:58.0350325Z #define cudaSurfaceType1DLayered 0xF1
2025-05-07T20:26:58.0350716Z #define cudaArraySurfaceLoadStore 0x02
2025-05-07T20:26:58.0351012Z #define _PTRDIFF_T 
2025-05-07T20:26:58.0351229Z #define _MOVE_H 1
2025-05-07T20:26:58.0351448Z #define __cpp_hex_float 201603L
2025-05-07T20:26:58.0351704Z #define ADJ_TAI 0x0080
2025-05-07T20:26:58.0351929Z #define __ptrvalue 
2025-05-07T20:26:58.0352145Z #define _GLIBCXX_HOSTED 1
2025-05-07T20:26:58.0352393Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:26:58.0352672Z #define __WTERMSIG(status) ((status) & 0x7f)
2025-05-07T20:26:58.0352964Z #define MATH_ERREXCEPT 2
2025-05-07T20:26:58.0353218Z #define _GLIBCXX_HAS_GTHREADS 1
2025-05-07T20:26:58.0353503Z #define cudaTextureType2DLayered 0xF2
2025-05-07T20:26:58.0353892Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0))
2025-05-07T20:26:58.0354265Z #define __USE_GNU 1
2025-05-07T20:26:58.0354494Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:26:58.0354766Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:26:58.0355031Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:26:58.0355428Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d)))
2025-05-07T20:26:58.0355839Z #define WEXITED 4
2025-05-07T20:26:58.0356072Z #define _IO_NO_READS 4
2025-05-07T20:26:58.0356374Z #define cudaGraphKernelNodePortLaunchCompletion 2
2025-05-07T20:26:58.0356722Z #define M_LOG2E 1.4426950408889634074
2025-05-07T20:26:58.0356995Z #define _POSIX_SYMLINK_MAX 255
2025-05-07T20:26:58.0357287Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1
2025-05-07T20:26:58.0357598Z #define __uid_t_defined 
2025-05-07T20:26:58.0357848Z #define __FD_ELT(d) ((d) / __NFDBITS)
2025-05-07T20:26:58.0358140Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1
2025-05-07T20:26:58.0358417Z #define WNOHANG 1
2025-05-07T20:26:58.0358664Z #define alloca(size) __builtin_alloca (size)
2025-05-07T20:26:58.0358969Z #define _GLIBCXX_HAVE_HYPOTF 1
2025-05-07T20:26:58.0359457Z #define cudaEventDefault 0x00
2025-05-07T20:26:58.0359868Z #define __maxnreg__(a) __attribute__((maxnreg(a)))
2025-05-07T20:26:58.0360210Z #define NL_SETMAX INT_MAX
2025-05-07T20:26:58.0360589Z #define __x86_64 1
2025-05-07T20:26:58.0360819Z #define __cudaCDP2LaunchDevice 
2025-05-07T20:26:58.0361203Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:58.0361671Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 {
2025-05-07T20:26:58.0362164Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:58.0362596Z #define __PTRDIFF_T 
2025-05-07T20:26:58.0362911Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW
2025-05-07T20:26:58.0363281Z #define _GLIBCXX_HAVE_FINITEL 1
2025-05-07T20:26:58.0363548Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:58.0363826Z #define _Mlong_double_ long double
2025-05-07T20:26:58.0364106Z #define __cpp_lambdas 200907L
2025-05-07T20:26:58.0364353Z #define _IO_DEC 020
2025-05-07T20:26:58.0364567Z #define _GLIBCXX_HAVE_SINHL 1
2025-05-07T20:26:58.0364838Z #define _POSIX_CLOCKRES_MIN 20000000
2025-05-07T20:26:58.0365135Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:26:58.0365403Z #define ADJ_TIMECONST 0x0020
2025-05-07T20:26:58.0365665Z #define _GLIBCXX_HAVE_SQRTL 1
2025-05-07T20:26:58.0365962Z #define __cudaCDP2DeviceGetSharedMemConfig 
2025-05-07T20:26:58.0366277Z #define _GLIBCXX_HAVE_STDALIGN_H 1
2025-05-07T20:26:58.0366543Z #define _ANSI_STDDEF_H 
2025-05-07T20:26:58.0366807Z #define _GLIBCXX_MOVE(__val) std::move(__val)
2025-05-07T20:26:58.0367114Z #define _GLIBCXX_HAVE_STRERROR_L 1
2025-05-07T20:26:58.0367475Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:26:58.0367847Z #define _GLIBCXX_USE_DEV_RANDOM 1
2025-05-07T20:26:58.0368123Z #define _STL_ITERATOR_BASE_TYPES_H 1
2025-05-07T20:26:58.0368403Z #define __cpp_template_auto 201606L
2025-05-07T20:26:58.0368755Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:26:58.0369118Z #define _GLIBCXX_HAVE_SYS_SEM_H 1
2025-05-07T20:26:58.0369562Z #define __key_t_defined 
2025-05-07T20:26:58.0369821Z #define _IO_MAGIC_MASK 0xFFFF0000
2025-05-07T20:26:58.0370192Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__)))
2025-05-07T20:26:58.0370649Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:26:58.0371006Z #define __GNUC_VA_LIST 
2025-05-07T20:26:58.0371336Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:58.0371709Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:26:58.0371967Z #define CLOCK_REALTIME_COARSE 5
2025-05-07T20:26:58.0372249Z #define _GLIBCXX14_CONSTEXPR constexpr
2025-05-07T20:26:58.0372535Z #define __USE_XOPEN2KXSI 1
2025-05-07T20:26:58.0372779Z #define __WCOREFLAG 0x80
2025-05-07T20:26:58.0373110Z #define M_2_SQRTPI 1.12837916709551257390
2025-05-07T20:26:58.0373413Z #define cudaEventDisableTiming 0x02
2025-05-07T20:26:58.0373682Z #define __LP64__ 1
2025-05-07T20:26:58.0373921Z #define __isascii_l(c,l) ((l), __isascii (c))
2025-05-07T20:26:58.0374241Z #define cudaStreamNonBlocking 0x01
2025-05-07T20:26:58.0374519Z #define _IO_off64_t __off64_t
2025-05-07T20:26:58.0374777Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:58.0375031Z #define __time_t_defined 1
2025-05-07T20:26:58.0375280Z #define _POSIX_SYMLOOP_MAX 8
2025-05-07T20:26:58.0382906Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:26:58.0383330Z #define __USE_UNIX98 1
2025-05-07T20:26:58.0383578Z #define __MODE_T_TYPE __U32_TYPE
2025-05-07T20:26:58.0383850Z #define CLOCK_REALTIME_ALARM 8
2025-05-07T20:26:58.0384118Z #define _GLIBCXX_HAVE_STRINGS_H 1
2025-05-07T20:26:58.0384415Z #define __LEAF_ATTR __attribute__ ((__leaf__))
2025-05-07T20:26:58.0384723Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:26:58.0384972Z #define SEEK_CUR 1
2025-05-07T20:26:58.0385204Z #define __RLIM64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:58.0385472Z #define _ASSERT_H 1
2025-05-07T20:26:58.0386055Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig))
2025-05-07T20:26:58.0386808Z #define _GLIBCXX_USE_DEPRECATED 1
2025-05-07T20:26:58.0387079Z #define CHAR_MAX SCHAR_MAX
2025-05-07T20:26:58.0387325Z #define _GLIBCXX_HAVE_SETENV 1
2025-05-07T20:26:58.0387581Z #define NL_ARGMAX _POSIX_ARG_MAX
2025-05-07T20:26:58.0387848Z #define _GLIBCXX_USE_UTIMENSAT 1
2025-05-07T20:26:58.0388211Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:58.0388605Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 
2025-05-07T20:26:58.0389249Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch)))
2025-05-07T20:26:58.0389885Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1
2025-05-07T20:26:58.0390184Z #define _IO_BOOLALPHA 0200000
2025-05-07T20:26:58.0390525Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912)
2025-05-07T20:26:58.0390891Z #define _GLIBCXX_PACKAGE_URL ""
2025-05-07T20:26:58.0391163Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:26:58.0391435Z #define cudaArrayDefault 0x00
2025-05-07T20:26:58.0391714Z #define __cudaCDP2LaunchDeviceV2 
2025-05-07T20:26:58.0391996Z #define __FDS_BITS(set) ((set)->fds_bits)
2025-05-07T20:26:58.0392264Z #define TLOSS 5
2025-05-07T20:26:58.0392476Z #define __ssize_t_defined 
2025-05-07T20:26:58.0392724Z #define __CUDACC_VER_BUILD__ 85
2025-05-07T20:26:58.0392991Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1
2025-05-07T20:26:58.0393272Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL)
2025-05-07T20:26:58.0393558Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:26:58.0393911Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11
2025-05-07T20:26:58.0394287Z #define _POSIX_HIWAT _POSIX_PIPE_BUF
2025-05-07T20:26:58.0394564Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:26:58.0394842Z #define __cudaCDP2EventRecordWithFlags 
2025-05-07T20:26:58.0395144Z #define _GLIBCXX_ATOMIC_BUILTINS 1
2025-05-07T20:26:58.0395540Z #define cudaPeerAccessDefault 0x00
2025-05-07T20:26:58.0395828Z #define __REGISTER_PREFIX__ 
2025-05-07T20:26:58.0396171Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:26:58.0396551Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 
2025-05-07T20:26:58.0396899Z #define _IOS_NOREPLACE 64
2025-05-07T20:26:58.0397128Z #define __cdecl 
2025-05-07T20:26:58.0397360Z #define cudaEventInterprocess 0x04
2025-05-07T20:26:58.0397678Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L
2025-05-07T20:26:58.0398081Z #define LOGIN_NAME_MAX 256
2025-05-07T20:26:58.0398328Z #define _IO_TIED_PUT_GET 0x400
2025-05-07T20:26:58.0398583Z #define X_TLOSS 1.41484755040568800000e+16
2025-05-07T20:26:58.0398872Z #define CUDA_IPC_HANDLE_SIZE 64
2025-05-07T20:26:58.0399128Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:26:58.0399424Z #define __attribute_pure__ __attribute__ ((__pure__))
2025-05-07T20:26:58.0399750Z #define __TEXTURE_TYPES_H__ 
2025-05-07T20:26:58.0400156Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:58.0400576Z #define ADJ_NANO 0x2000
2025-05-07T20:26:58.0400876Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:26:58.0401223Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:26:58.0401500Z #define _GLIBCXX_HAVE_ISWBLANK 1
2025-05-07T20:26:58.0401750Z #define __FLT_DIG__ 6
2025-05-07T20:26:58.0402094Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias)
2025-05-07T20:26:58.0402482Z #define __NO_INLINE__ 1
2025-05-07T20:26:58.0402774Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:58.0403115Z #define _POSIX_NGROUPS_MAX 8
2025-05-07T20:26:58.0403369Z #define ADJ_STATUS 0x0010
2025-05-07T20:26:58.0403627Z #define __cudaCDP2MemcpyAsync_ptsz 
2025-05-07T20:26:58.0403912Z #define CLOCK_BOOTTIME_ALARM 9
2025-05-07T20:26:58.0404173Z #define LONG_LONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:58.0404461Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1
2025-05-07T20:26:58.0404749Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:26:58.0405227Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000
2025-05-07T20:26:58.0405636Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1
2025-05-07T20:26:58.0405972Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:26:58.0406310Z #define CHAR_MIN SCHAR_MIN
2025-05-07T20:26:58.0406547Z #define MAX_CANON 255
2025-05-07T20:26:58.0406772Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:26:58.0407019Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:26:58.0407282Z #define _GLIBCXX_HAVE_COMPLEX_H 1
2025-05-07T20:26:58.0407556Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 
2025-05-07T20:26:58.0407859Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX
2025-05-07T20:26:58.0408150Z #define _GLIBCXX_HAVE_HYPOT 1
2025-05-07T20:26:58.0408420Z #define __cudaCDP2Memset2DAsync_ptsz 
2025-05-07T20:26:58.0408735Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1
2025-05-07T20:26:58.0409039Z #define __VERSION__ "11.4.0"
2025-05-07T20:26:58.0409295Z #define _GLIBCXX11_USE_C99_STDLIB 1
2025-05-07T20:26:58.0409595Z #define cudaHostRegisterMapped 0x02
2025-05-07T20:26:58.0409880Z #define _GLIBCXX_HAVE_INT64_T 1
2025-05-07T20:26:58.0410155Z #define _GLIBCXX_USE_CONSTEXPR constexpr
2025-05-07T20:26:58.0410455Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp)
2025-05-07T20:26:58.0410742Z #define __UINT64_C(c) c ## UL
2025-05-07T20:26:58.0410994Z #define MOD_OFFSET ADJ_OFFSET
2025-05-07T20:26:58.0411237Z #define _SYS_TYPES_H 1
2025-05-07T20:26:58.0411471Z #define AIO_PRIO_DELTA_MAX 20
2025-05-07T20:26:58.0411728Z #define _GLIBCXX_HAVE_TANHF 1
2025-05-07T20:26:58.0411970Z #define _SYS_CDEFS_H 1
2025-05-07T20:26:58.0412201Z #define _GLIBCXX_HAVE_TANHL 1
2025-05-07T20:26:58.0412469Z #define __cpp_unicode_characters 201411L
2025-05-07T20:26:58.0412753Z #define _IO_ERR_SEEN 0x20
2025-05-07T20:26:58.0413003Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1
2025-05-07T20:26:58.0413366Z #define __cudaCDP2StreamDestroy 
2025-05-07T20:26:58.0413628Z #define FP_SUBNORMAL 3
2025-05-07T20:26:58.0413961Z #define cudaOccupancyDefault 0x00
2025-05-07T20:26:58.0414246Z #define _INITIALIZER_LIST 
2025-05-07T20:26:58.0414489Z #define _STDC_PREDEF_H 1
2025-05-07T20:26:58.0414734Z #define __CUDA_RUNTIME_API_H__ 
2025-05-07T20:26:58.0415005Z #define _GLIBCXX_PACKAGE_BUGREPORT ""
2025-05-07T20:26:58.0415287Z #define _GLIBCXX_HAVE_MODF 1
2025-05-07T20:26:58.0415534Z #define _IO_file_flags _flags
2025-05-07T20:26:58.0415797Z #define __USE_XOPEN2K8 1
2025-05-07T20:26:58.0416086Z #define htobe64(x) __bswap_64 (x)
2025-05-07T20:26:58.0416357Z #define _OLD_STDIO_MAGIC 0xFABC0000
2025-05-07T20:26:58.0416624Z #define HUGE 3.40282347e+38F
2025-05-07T20:26:58.0416883Z #define __cpp_lib_is_null_pointer 201309
2025-05-07T20:26:58.0417249Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status))
2025-05-07T20:26:58.0417636Z #define islower_l(c,l) __islower_l ((c), (l))
2025-05-07T20:26:58.0417942Z #define _GLIBCXX_USE_CXX11_ABI 1
2025-05-07T20:26:58.0418201Z #define _GLIBCXX_HAVE_SYMLINK 1
2025-05-07T20:26:58.0418455Z #define _BSD_SOURCE 1
2025-05-07T20:26:58.0418696Z #define _GLIBCXX_THROW(_EXC) 
2025-05-07T20:26:58.0419526Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template<typename _Tp, typename = __void_t<>> struct __has_ ##_NTYPE : false_type { }; template<typename _Tp> struct __has_ ##_NTYPE<_Tp, __void_t<typename _Tp::_NTYPE>> : true_type { };
2025-05-07T20:26:58.0420346Z #define __catch(X) catch(X)
2025-05-07T20:26:58.0420606Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:26:58.0420897Z #define LINE_MAX _POSIX2_LINE_MAX
2025-05-07T20:26:58.0421161Z #define __TIMER_T_TYPE void *
2025-05-07T20:26:58.0421412Z #define __STRING(x) #x
2025-05-07T20:26:58.0421655Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:26:58.0421921Z #define _T_PTRDIFF_ 
2025-05-07T20:26:58.0422164Z #define _GLIBCXX_USE_NOEXCEPT noexcept
2025-05-07T20:26:58.0422463Z #define cudaEventWaitExternal 0x01
2025-05-07T20:26:58.0422734Z #define __unbounded 
2025-05-07T20:26:58.0422973Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:58.0423267Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:26:58.0423629Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:58.0423920Z #define be16toh(x) __bswap_16 (x)
2025-05-07T20:26:58.0424197Z #define __cpp_lib_is_final 201402L
2025-05-07T20:26:58.0424493Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 
2025-05-07T20:26:58.0424807Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL)
2025-05-07T20:26:58.0425108Z #define __MATH_DECLARE_LDOUBLE 1
2025-05-07T20:26:58.0425387Z #define __managed__ __location__(managed)
2025-05-07T20:26:58.0425673Z #define _POSIX2_EXPR_NEST_MAX 32
2025-05-07T20:26:58.0426111Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:58.0426517Z #define _POSIX_STREAM_MAX 8
2025-05-07T20:26:58.0426772Z #define __LIBRARY_TYPES_H__ 
2025-05-07T20:26:58.0427135Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11
2025-05-07T20:26:58.0427525Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:26:58.0427777Z #define _SYS_SIZE_T_H 
2025-05-07T20:26:58.0428065Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10)
2025-05-07T20:26:58.0428403Z #define _GLIBCXX_STDLIB_H 1
2025-05-07T20:26:58.0428679Z #define isupper_l(c,l) __isupper_l ((c), (l))
2025-05-07T20:26:58.0428967Z #define _CRTIMP 
2025-05-07T20:26:58.0429188Z #define _GLIBCXX_CXX_CONFIG_H 1
2025-05-07T20:26:58.0429489Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:58.0429804Z #define STA_PPSJITTER 0x0200
2025-05-07T20:26:58.0430152Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0)
2025-05-07T20:26:58.0430549Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:58.0430859Z #define _GLIBCXX_HAVE_ISINFF 1
2025-05-07T20:26:58.0431132Z #define __glibcxx_requires_subscript(_N) 
2025-05-07T20:26:58.0431411Z #define __SIZE_T__ 
2025-05-07T20:26:58.0431624Z #define __stub_gtty 
2025-05-07T20:26:58.0431847Z #define __pid_t_defined 
2025-05-07T20:26:58.0432100Z #define _GLIBCXX_FWDREF(_Tp) _Tp&&
2025-05-07T20:26:58.0432488Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:58.0432802Z #define __glibcxx_function_requires(...) 
2025-05-07T20:26:58.0433092Z #define __SM_80_RT_HPP__ 
2025-05-07T20:26:58.0433331Z #define __need_clockid_t 
2025-05-07T20:26:58.0433603Z #define SSIZE_MAX LONG_MAX
2025-05-07T20:26:58.0433862Z #define _GLIBCXX_HAVE_USELOCALE 1
2025-05-07T20:26:58.0434171Z #define __glibcxx_requires_string_len(_String,_Len) 
2025-05-07T20:26:58.0434480Z #define _IO_HEX 0100
2025-05-07T20:26:58.0434733Z #define __NFDBITS (8 * (int) sizeof (__fd_mask))
2025-05-07T20:26:58.0435062Z #define cudaExternalMemoryDedicated 0x1
2025-05-07T20:26:58.0435361Z #define _GLIBCXX_HAVE_TGMATH_H 1
2025-05-07T20:26:58.0435625Z #define _GLIBCXX11_USE_C99_COMPLEX 1
2025-05-07T20:26:58.0436023Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:58.0436449Z #define ispunct_l(c,l) __ispunct_l ((c), (l))
2025-05-07T20:26:58.0436753Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:26:58.0437045Z #define __cudaGet_blockDim() blockDim
2025-05-07T20:26:58.0437336Z #define __cudaCDP2Memcpy3DAsync 
2025-05-07T20:26:58.0437615Z #define __cudaCDP2MemcpyAsync 
2025-05-07T20:26:58.0437869Z #define __stub_sstk 
2025-05-07T20:26:58.0438096Z #define _IO_IN_BACKUP 0x100
2025-05-07T20:26:58.0438400Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB
2025-05-07T20:26:58.0438723Z #define __wur 
2025-05-07T20:26:58.0438959Z #define isprint_l(c,l) __isprint_l ((c), (l))
2025-05-07T20:26:58.0439249Z #define _G_HAVE_MMAP 1
2025-05-07T20:26:58.0439471Z #define _IO_OCT 040
2025-05-07T20:26:58.0439691Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:26:58.0439949Z #define NL_MSGMAX INT_MAX
2025-05-07T20:26:58.0440192Z #define _GLIBCXX_USE_LFS 1
2025-05-07T20:26:58.0440470Z #define cudaDeviceScheduleBlockingSync 0x04
2025-05-07T20:26:58.0440775Z #define _POSIX_RTSIG_MAX 8
2025-05-07T20:26:58.0441029Z #define _GLIBCXX_NOEXCEPT noexcept
2025-05-07T20:26:58.0441388Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 
2025-05-07T20:26:58.0441765Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:26:58.0442101Z #define _STL_ALGOBASE_H 1
2025-05-07T20:26:58.0442356Z #define __cudaCDP2MemsetAsync_ptsz 
2025-05-07T20:26:58.0442633Z #define __off64_t_defined 
2025-05-07T20:26:58.0442885Z #define _GLIBCXX_WEAK_DEFINITION 
2025-05-07T20:26:58.0443142Z #define __FLT128_DIG__ 33
2025-05-07T20:26:58.0443392Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1
2025-05-07T20:26:58.0443678Z #define _GLIBCXX_HAVE_LOCALE_H 1
2025-05-07T20:26:58.0443930Z #define __INT32_C(c) c
2025-05-07T20:26:58.0444161Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:26:58.0444426Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:26:58.0444692Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:26:58.0444955Z #define __PDP_ENDIAN 3412
2025-05-07T20:26:58.0445193Z #define _ISOC95_SOURCE 1
2025-05-07T20:26:58.0445436Z #define _IO_fpos64_t _G_fpos64_t
2025-05-07T20:26:58.0445730Z #define M_PI_2l 1.570796326794896619231321691639751442L
2025-05-07T20:26:58.0446051Z #define BYTE_ORDER __BYTE_ORDER
2025-05-07T20:26:58.0446304Z #define __SM_90_RT_HPP__ 
2025-05-07T20:26:58.0446545Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:26:58.0446820Z #define __have_pthread_attr_t 1
2025-05-07T20:26:58.0447082Z #define _GLIBCXX_HAVE_LIMIT_DATA 1
2025-05-07T20:26:58.0447470Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11
2025-05-07T20:26:58.0447894Z #define __cudaCDP2StreamWaitEvent 
2025-05-07T20:26:58.0448183Z #define __cudaCDP2EventRecord 
2025-05-07T20:26:58.0448441Z #define _BITS_TYPESIZES_H 1
2025-05-07T20:26:58.0448679Z #define htole32(x) (x)
2025-05-07T20:26:58.0449066Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 
2025-05-07T20:26:58.0449523Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE
2025-05-07T20:26:58.0449824Z #define _GLIBCXX_USE_C99_MATH_TR1 1
2025-05-07T20:26:58.0450160Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status))
2025-05-07T20:26:58.0450539Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH
2025-05-07T20:26:58.0450967Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:26:58.0451324Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0)
2025-05-07T20:26:58.0451637Z #define ADJ_OFFSET 0x0001
2025-05-07T20:26:58.0451884Z #define cudaArrayLayered 0x01
2025-05-07T20:26:58.0452210Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800)
2025-05-07T20:26:58.0452576Z #define cudaEventRecordDefault 0x00
2025-05-07T20:26:58.0452859Z #define _GLIBCXX_HAVE_FMODF 1
2025-05-07T20:26:58.0453201Z #define _PSTL_PRAGMA_MESSAGE(x) 
2025-05-07T20:26:58.0453471Z #define unix 1
2025-05-07T20:26:58.0453683Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:26:58.0453928Z #define _POSIX_CHILD_MAX 25
2025-05-07T20:26:58.0454179Z #define _POSIX_MAX_INPUT 255
2025-05-07T20:26:58.0454454Z #define __cudaCDP2DeviceGetCacheConfig 
2025-05-07T20:26:58.0454744Z #define __USE_POSIX 1
2025-05-07T20:26:58.0454978Z #define __FD_ZERO_STOS "stosq"
2025-05-07T20:26:58.0455268Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000)
2025-05-07T20:26:58.0455579Z #define __THROWNL throw ()
2025-05-07T20:26:58.0455824Z #define __cpp_rtti 199711L
2025-05-07T20:26:58.0456075Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:26:58.0456352Z #define __PMT(args) args
2025-05-07T20:26:58.0456606Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:58.0456950Z #define __va_arg_pack_len() __builtin_va_arg_pack_len ()
2025-05-07T20:26:58.0457294Z #define __ULONGWORD_TYPE unsigned long int
2025-05-07T20:26:58.0457588Z #define _SIZE_T_DECLARED 
2025-05-07T20:26:58.0457833Z #define _PSTL_STRING_AUX(x) #x
2025-05-07T20:26:58.0458086Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:26:58.0458623Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402)
2025-05-07T20:26:58.0459453Z #define _GLIBCXX_HAVE_LIMIT_AS 1
2025-05-07T20:26:58.0459747Z #define XATTR_LIST_MAX 65536
2025-05-07T20:26:58.0459999Z #define __CUDACC_VER_MAJOR__ 12
2025-05-07T20:26:58.0460309Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:26:58.0460784Z #define _WCHAR_T_H 
2025-05-07T20:26:58.0460999Z #define __FLT64X_DIG__ 18
2025-05-07T20:26:58.0461233Z #define _IO_SHOWBASE 0200
2025-05-07T20:26:58.0461465Z #define _POSIX_QLIMIT 1
2025-05-07T20:26:58.0461705Z #define __INT8_TYPE__ signed char
2025-05-07T20:26:58.0461965Z #define __SURFACE_TYPES_H__ 
2025-05-07T20:26:58.0462214Z #define __CUDA_ARCH__ 520
2025-05-07T20:26:58.0462466Z #define __cpp_digit_separators 201309L
2025-05-07T20:26:58.0462735Z #define __ELF__ 1
2025-05-07T20:26:58.0462955Z #define CLOCK_THREAD_CPUTIME_ID 3
2025-05-07T20:26:58.0463227Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:26:58.0463481Z #define STA_INS 0x0010
2025-05-07T20:26:58.0463715Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:26:58.0464063Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)])
2025-05-07T20:26:58.0464408Z #define _BITS_BYTESWAP_H 1
2025-05-07T20:26:58.0464654Z #define __ID_T_TYPE __U32_TYPE
2025-05-07T20:26:58.0464924Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:58.0465235Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 
2025-05-07T20:26:58.0465523Z #define _GLIBCXX_HAVE_MBSTATE_T 1
2025-05-07T20:26:58.0465800Z #define __cpp_lib_logical_traits 201510
2025-05-07T20:26:58.0466086Z #define ADJ_OFFSET_SS_READ 0xa001
2025-05-07T20:26:58.0466407Z #define __warnattr(msg) __attribute__((__warning__ (msg)))
2025-05-07T20:26:58.0466799Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: "
2025-05-07T20:26:58.0467135Z #define _IO_funlockfile(_fp) 
2025-05-07T20:26:58.0467609Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:58.0468151Z #define M_2_PIl 0.636619772367581343075535053490057448L
2025-05-07T20:26:58.0468462Z #define __DRIVER_TYPES_H__ 
2025-05-07T20:26:58.0468699Z #define __FLT_RADIX__ 2
2025-05-07T20:26:58.0468944Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:26:58.0469294Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:26:58.0469783Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:26:58.0470043Z #define _GLIBCXX_USE_LSTAT 1
2025-05-07T20:26:58.0470306Z #define minor(dev) gnu_dev_minor (dev)
2025-05-07T20:26:58.0470587Z #define _POSIX_C_SOURCE 200809L
2025-05-07T20:26:58.0470842Z #define _GLIBCXX_HAVE_DIRENT_H 1
2025-05-07T20:26:58.0471112Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:26:58.0471378Z #define WORD_BIT 32
2025-05-07T20:26:58.0471589Z #define _IO_USER_BUF 1
2025-05-07T20:26:58.0471817Z #define __VECTOR_TYPES_H__ 
2025-05-07T20:26:58.0472070Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:58.0472359Z #define cudaHostAllocPortable 0x01
2025-05-07T20:26:58.0472639Z #define PTHREAD_STACK_MIN 16384
2025-05-07T20:26:58.0472903Z #define __long_double_t long double
2025-05-07T20:26:58.0473171Z #define _GLIBCXX_HAVE_ISINF 1
2025-05-07T20:26:58.0473425Z #define _POSIX_ARG_MAX 4096
2025-05-07T20:26:58.0473978Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode
2025-05-07T20:26:58.0474551Z #define __k8 1
2025-05-07T20:26:58.0474857Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23)
2025-05-07T20:26:58.0475300Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:26:58.0475669Z #define __LDBL_REDIR(name,proto) name proto
2025-05-07T20:26:58.0475965Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:26:58.0476242Z #define __SM_30_INTRINSICS_HPP__ 
2025-05-07T20:26:58.0476519Z #define _GLIBCXX_EXTERN_TEMPLATE 1
2025-05-07T20:26:58.0476786Z #define __blksize_t_defined 
2025-05-07T20:26:58.0477037Z #define _IO_SHOWPOINT 0400
2025-05-07T20:26:58.0477285Z #define _GLIBCXX_HAVE_LIMIT_RSS 1
2025-05-07T20:26:58.0477569Z #define cudaDeviceLmemResizeToMax 0x10
2025-05-07T20:26:58.0477857Z #define _GLIBCXX_X86_RDRAND 1
2025-05-07T20:26:58.0478124Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:26:58.0478409Z #define _IO_IS_FILEBUF 0x2000
2025-05-07T20:26:58.0478661Z #define _GLIBCXX_USE_DUAL_ABI 1
2025-05-07T20:26:58.0479081Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8)))
2025-05-07T20:26:58.0479843Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2)
2025-05-07T20:26:58.0480365Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1)
2025-05-07T20:26:58.0480641Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:26:58.0480887Z #define SEEK_SET 0
2025-05-07T20:26:58.0481108Z #define _GLIBCXX_TR1_GAMMA_TCC 1
2025-05-07T20:26:58.0481373Z #define __CUDA_API_VER_MINOR__ 6
2025-05-07T20:26:58.0481725Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V)))
2025-05-07T20:26:58.0482099Z #define _GLIBCXX20_DEPRECATED(MSG) 
2025-05-07T20:26:58.0482377Z #define __cudaCDP2GetLastError 
2025-05-07T20:26:58.0482638Z #define _GLIBCXX_HAVE_COSL 1
2025-05-07T20:26:58.0482884Z #define _MATH_H_MATHDEF 1
2025-05-07T20:26:58.0483343Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24))
2025-05-07T20:26:58.0483840Z #define _GLIBCXX_USE_FLOAT128 1
2025-05-07T20:26:58.0484101Z #define _IO_FLAGS2_NOTCANCEL 2
2025-05-07T20:26:58.0484349Z #define __stub_sigreturn 
2025-05-07T20:26:58.0484727Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg)))
2025-05-07T20:26:58.0485148Z #define _GLIBCXX_HAVE_UTIME_H 1
2025-05-07T20:26:58.0485406Z #define __HOST_CONFIG_H__ 
2025-05-07T20:26:58.0485648Z #define _XOPEN_SOURCE_EXTENDED 1
2025-05-07T20:26:58.0485901Z #define CLOCK_TAI 11
2025-05-07T20:26:58.0486135Z #define _GLIBCXX_END_NAMESPACE_VERSION 
2025-05-07T20:26:58.0486408Z #define __restrict_arr 
2025-05-07T20:26:58.0486658Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 
2025-05-07T20:26:58.0486991Z #define __glibcxx_requires_valid_range(_First,_Last) 
2025-05-07T20:26:58.0487806Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:58.0488577Z #define __attribute_artificial__ __attribute__ ((__artificial__))
2025-05-07T20:26:58.0488930Z #define __USE_MISC 1
2025-05-07T20:26:58.0489168Z #define __UWORD_TYPE unsigned long int
2025-05-07T20:26:58.0489450Z #define _EXCEPTION_DEFINES_H 1
2025-05-07T20:26:58.0489697Z #define _GCC_LIMITS_H_ 
2025-05-07T20:26:58.0489923Z #define __LDBL_DIG__ 18
2025-05-07T20:26:58.0490151Z #define __BIT_TYPES_DEFINED__ 1
2025-05-07T20:26:58.0490415Z #define __malloc_and_calloc_defined 
2025-05-07T20:26:58.0490687Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:26:58.0490948Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1
2025-05-07T20:26:58.0491210Z #define __x86_64__ 1
2025-05-07T20:26:58.0491419Z #define _SIZE_T_ 
2025-05-07T20:26:58.0492406Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56)))
2025-05-07T20:26:58.0493537Z #define _POSIX2_COLL_WEIGHTS_MAX 2
2025-05-07T20:26:58.0493819Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:26:58.0494106Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1
2025-05-07T20:26:58.0494420Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:26:58.0494716Z #define _IO_iconv_t _G_iconv_t
2025-05-07T20:26:58.0494985Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1
2025-05-07T20:26:58.0495294Z #define __cpp_lib_make_reverse_iterator 201402
2025-05-07T20:26:58.0495637Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 
2025-05-07T20:26:58.0495961Z #define _GLIBCXX_HAVE_DLFCN_H 1
2025-05-07T20:26:58.0496578Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:58.0497248Z #define __no_return__ __attribute__((noreturn))
2025-05-07T20:26:58.0497689Z #define __device_builtin__ __location__(device_builtin)
2025-05-07T20:26:58.0498022Z #define _PSTL_HIDE_FROM_ABI_POP 
2025-05-07T20:26:58.0498284Z #define _GLIBCXX_HAVE_ACOSF 1
2025-05-07T20:26:58.0498577Z #define STA_FLL 0x0008
2025-05-07T20:26:58.0498930Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1
2025-05-07T20:26:58.0499254Z #define _GLIBCXX_END_EXTERN_C }
2025-05-07T20:26:58.0499539Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:58.0499857Z #define __cpp_lib_integer_sequence 201304
2025-05-07T20:26:58.0500138Z #define __stub_revoke 
2025-05-07T20:26:58.0500365Z #define __timer_t_defined 1
2025-05-07T20:26:58.0500646Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED
2025-05-07T20:26:58.0501121Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1)
2025-05-07T20:26:58.0501536Z #define _GLIBCXX_END_NAMESPACE_CXX11 }
2025-05-07T20:26:58.0506392Z #define _GLIBCXX_ICONV_CONST 
2025-05-07T20:26:58.0506706Z #define major(dev) gnu_dev_major (dev)
2025-05-07T20:26:58.0507012Z #define cudaArrayTextureGather 0x08
2025-05-07T20:26:58.0507300Z #define _GLIBCXX_LT_OBJDIR ".libs/"
2025-05-07T20:26:58.0507620Z #define __inline_hint__ __attribute__((nv_inline_hint))
2025-05-07T20:26:58.0507945Z #define __NV_LEGACY_LAUNCH 1
2025-05-07T20:26:58.0508197Z #define _IO_off_t __off_t
2025-05-07T20:26:58.0508435Z #define __FLT64_DIG__ 15
2025-05-07T20:26:58.0508803Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS
2025-05-07T20:26:58.0509205Z #define _POSIX2_LINE_MAX 2048
2025-05-07T20:26:58.0509492Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:58.0509835Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:26:58.0510133Z #define ADJ_FREQUENCY 0x0002
2025-05-07T20:26:58.0510397Z #define __CUDART_API_PTDS(api) api
2025-05-07T20:26:58.0510658Z #define NULL __null
2025-05-07T20:26:58.0510912Z #define cudaStreamPerThread ((cudaStream_t)0x2)
2025-05-07T20:26:58.0511351Z #define _GLIBCXX_CONSTEXPR constexpr
2025-05-07T20:26:58.0511639Z #define __U64_TYPE unsigned long int
2025-05-07T20:26:58.0511906Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:26:58.0512164Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:26:58.0512404Z #define FP_ZERO 2
2025-05-07T20:26:58.0512621Z #define _GLIBCXX_HAVE_FLOORL 1
2025-05-07T20:26:58.0512931Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l))
2025-05-07T20:26:58.0513278Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:58.0513556Z #define __WCHAR_T__ 
2025-05-07T20:26:58.0513775Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:26:58.0514127Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:26:58.0514571Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__))
2025-05-07T20:26:58.0514907Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:26:58.0515189Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:26:58.0515506Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 
2025-05-07T20:26:58.0515835Z #define __WSTOPSIG(status) __WEXITSTATUS(status)
2025-05-07T20:26:58.0516177Z #define cudaSurfaceTypeCubemapLayered 0xFC
2025-05-07T20:26:58.0516481Z #define _BSD_PTRDIFF_T_ 
2025-05-07T20:26:58.0516721Z #define _SIGSET_H_types 1
2025-05-07T20:26:58.0516983Z #define cudaTextureType1DLayered 0xF1
2025-05-07T20:26:58.0517279Z #define __cpp_unicode_literals 200710L
2025-05-07T20:26:58.0517608Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l))
2025-05-07T20:26:58.0517943Z #define __LONG_LONG_PAIR(HI,LO) LO, HI
2025-05-07T20:26:58.0518244Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:26:58.0518581Z #define __bos0(ptr) __builtin_object_size (ptr, 0)
2025-05-07T20:26:58.0518903Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:26:58.0519220Z #define M_1_PIl 0.318309886183790671537767526745028724L
2025-05-07T20:26:58.0519604Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status))
2025-05-07T20:26:58.0519962Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:26:58.0520230Z #define _POSIX2_CHARCLASS_NAME_MAX 14
2025-05-07T20:26:58.0520599Z #define _GLIBCXX_BITS_STD_ABS_H 
2025-05-07T20:26:58.0520864Z #define STA_MODE 0x4000
2025-05-07T20:26:58.0521112Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:26:58.0521404Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:26:58.0521695Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0)
2025-05-07T20:26:58.0521999Z #define __USING_NAMESPACE_C99(name) 
2025-05-07T20:26:58.0522272Z #define BIG_ENDIAN __BIG_ENDIAN
2025-05-07T20:26:58.0522649Z #define __cudaCDP2EventRecord_ptsz 
2025-05-07T20:26:58.0522952Z #define _GLIBCXX_HAVE_SINL 1
2025-05-07T20:26:58.0523223Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX
2025-05-07T20:26:58.0523509Z #define __SIZE_WIDTH__ 64
2025-05-07T20:26:58.0523776Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:58.0524058Z #define __SEG_FS 1
2025-05-07T20:26:58.0524275Z #define _IO_size_t size_t
2025-05-07T20:26:58.0524522Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:26:58.0524795Z #define INT_MIN (-INT_MAX - 1)
2025-05-07T20:26:58.0525052Z #define __stub_lchmod 
2025-05-07T20:26:58.0525293Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:26:58.0525564Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:58.0525851Z #define _GLIBCXX_MANGLE_SIZE_T m
2025-05-07T20:26:58.0526107Z #define __SEG_GS 1
2025-05-07T20:26:58.0526412Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:26:58.0526769Z #define _IOS_APPEND 8
2025-05-07T20:26:58.0527001Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:26:58.0527257Z #define _GLIBCXX_RELEASE 11
2025-05-07T20:26:58.0527506Z #define _GLIBCXX98_USE_C99_WCHAR 1
2025-05-07T20:26:58.0527778Z #define _IO_IS_APPENDING 0x1000
2025-05-07T20:26:58.0528043Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:26:58.0528305Z #define htole16(x) (x)
2025-05-07T20:26:58.0528549Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:58.0528836Z #define _GLIBCXX_HAVE_FCNTL_H 1
2025-05-07T20:26:58.0529089Z #define __INT16_TYPE__ short int
2025-05-07T20:26:58.0529448Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:26:58.0529756Z #define __glibcxx_class_requires(_a,_b) 
2025-05-07T20:26:58.0530055Z #define __cpp_structured_bindings 201606L
2025-05-07T20:26:58.0530368Z #define __align__(n) __attribute__((aligned(n)))
2025-05-07T20:26:58.0530662Z #define __SIZEOF_INT__ 4
2025-05-07T20:26:58.0530898Z #define __WCLONE 0x80000000
2025-05-07T20:26:58.0531140Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:26:58.0531380Z #define SEEK_HOLE 4
2025-05-07T20:26:58.0531591Z #define TIMER_ABSTIME 1
2025-05-07T20:26:58.0531820Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:26:58.0532068Z #define __CUDA_MATH_CRTIMP 
2025-05-07T20:26:58.0532395Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:58.0532767Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:58.0533155Z #define __DRIVER_FUNCTIONS_H__ 
2025-05-07T20:26:58.0533430Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:26:58.0533723Z #define __MATH_FUNCTIONS_HPP__ 
2025-05-07T20:26:58.0534014Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:26:58.0534319Z #define _LINUX_LIMITS_H 
2025-05-07T20:26:58.0534547Z #define linux 1
2025-05-07T20:26:58.0534755Z #define MOD_MICRO ADJ_MICRO
2025-05-07T20:26:58.0535020Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 
2025-05-07T20:26:58.0535314Z #define _GLIBCXX_HAVE_VSWSCANF 1
2025-05-07T20:26:58.0535572Z #define _GLIBCXX_HAVE_ISNAN 1
2025-05-07T20:26:58.0535838Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV
2025-05-07T20:26:58.0536169Z #define __cudart_builtin__ __location__(cudart_builtin)
2025-05-07T20:26:58.0536490Z #define __cpp_lib_hypot 201603
2025-05-07T20:26:58.0536749Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:26:58.0537018Z #define _GLIBCXX_HAVE_WCTYPE_H 1
2025-05-07T20:26:58.0537272Z #define MOD_NANO ADJ_NANO
2025-05-07T20:26:58.0537503Z #define htole64(x) (x)
2025-05-07T20:26:58.0537737Z #define FP_ILOGBNAN (-2147483647 - 1)
2025-05-07T20:26:58.0538041Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_))
2025-05-07T20:26:58.0538355Z #define _IO_UPPERCASE 01000
2025-05-07T20:26:58.0539090Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference
2025-05-07T20:26:58.0539751Z #define __USE_POSIX2 1
2025-05-07T20:26:58.0539976Z #define INT_MAX __INT_MAX__
2025-05-07T20:26:58.0540225Z #define MOD_ESTERROR ADJ_ESTERROR
2025-05-07T20:26:58.0540485Z #define __WALL 0x40000000
2025-05-07T20:26:58.0540722Z #define _GLIBCXX_HAVE_LDEXPF 1
2025-05-07T20:26:58.0540967Z #define _XLOCALE_H 1
2025-05-07T20:26:58.0541202Z #define _GLIBCXX_USE_TMPNAM 1
2025-05-07T20:26:58.0541457Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:26:58.0541722Z #define __KEY_T_TYPE __S32_TYPE
2025-05-07T20:26:58.0541986Z #define __cudaGet_threadIdx() threadIdx
2025-05-07T20:26:58.0542263Z #define __EXCEPTIONS 1
2025-05-07T20:26:58.0542504Z #define __CUDART_API_PTSZ(api) api
2025-05-07T20:26:58.0542868Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__))
2025-05-07T20:26:58.0543238Z #define __WORDSIZE 64
2025-05-07T20:26:58.0543467Z #define CLOCK_MONOTONIC 1
2025-05-07T20:26:58.0543700Z #define _STL_RELOPS_H 1
2025-05-07T20:26:58.0543933Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:26:58.0544185Z #define __BEGIN_DECLS extern "C" {
2025-05-07T20:26:58.0544454Z #define _GLIBCXX_HAVE_SYS_IPC_H 1
2025-05-07T20:26:58.0544725Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:26:58.0544976Z #define _GLIBCXX_HAVE_TRUNCATE 1
2025-05-07T20:26:58.0545434Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension
2025-05-07T20:26:58.0546048Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:58.0546492Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11::
2025-05-07T20:26:58.0546789Z #define _GLIBCXX_NUMERIC_LIMITS 1
2025-05-07T20:26:58.0547061Z #define __cpp_range_based_for 201603L
2025-05-07T20:26:58.0547361Z #define __cpp_lib_exchange_function 201304
2025-05-07T20:26:58.0547654Z #define _GLIBCXX_HAVE_INTTYPES_H 1
2025-05-07T20:26:58.0548020Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1
2025-05-07T20:26:58.0548408Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02
2025-05-07T20:26:58.0548770Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:26:58.0549030Z #define _GLIBCXX_CSTDLIB 1
2025-05-07T20:26:58.0549291Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1
2025-05-07T20:26:58.0549650Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:58.0550023Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:26:58.0550302Z #define _STRING_H 1
2025-05-07T20:26:58.0550523Z #define _BITS_PTHREADTYPES_H 1
2025-05-07T20:26:58.0550779Z #define _GCC_MAX_ALIGN_T 
2025-05-07T20:26:58.0551033Z #define __SM_32_INTRINSICS_HPP__ 
2025-05-07T20:26:58.0551338Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:26:58.0551651Z #define __code_model_small__ 1
2025-05-07T20:26:58.0551899Z #define _PSTL_CONFIG_H 
2025-05-07T20:26:58.0552140Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:26:58.0552439Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:26:58.0552735Z #define __SM_20_INTRINSICS_H__ 
2025-05-07T20:26:58.0552997Z #define cudaCpuDeviceId ((int)-1)
2025-05-07T20:26:58.0553503Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:58.0554016Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:26:58.0554261Z #define le64toh(x) (x)
2025-05-07T20:26:58.0554483Z #define FILENAME_MAX 4096
2025-05-07T20:26:58.0554777Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l))
2025-05-07T20:26:58.0555120Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:26:58.0555402Z #define L_cuserid 9
2025-05-07T20:26:58.0555611Z #define __ino_t_defined 
2025-05-07T20:26:58.0555845Z #define __k8__ 1
2025-05-07T20:26:58.0556060Z #define __INTPTR_TYPE__ long int
2025-05-07T20:26:58.0556332Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:26:58.0556614Z #define __int8_t_defined 
2025-05-07T20:26:58.0556856Z #define __WCHAR_TYPE__ int
2025-05-07T20:26:58.0557219Z #define __CLOCKID_T_TYPE __S32_TYPE
2025-05-07T20:26:58.0557509Z #define cudaHostRegisterPortable 0x01
2025-05-07T20:26:58.0557800Z #define __SLONGWORD_TYPE long int
2025-05-07T20:26:58.0558054Z #define _IOS_TRUNC 16
2025-05-07T20:26:58.0558306Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++"
2025-05-07T20:26:58.0558652Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l))
2025-05-07T20:26:58.0558966Z #define __HAVE_COLUMN 
2025-05-07T20:26:58.0559384Z #define __stub_fdetach 
2025-05-07T20:26:58.0559986Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported.  Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead."
2025-05-07T20:26:58.0560561Z #define __pic__ 2
2025-05-07T20:26:58.0560807Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:58.0561117Z #define CLOCKS_PER_SEC 1000000l
2025-05-07T20:26:58.0561381Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:26:58.0561646Z #define _GLIBCXX_HAVE_SOCKATMARK 1
2025-05-07T20:26:58.0561923Z #define __stub_chflags 
2025-05-07T20:26:58.0562178Z #define CLOCK_BOOTTIME 7
2025-05-07T20:26:58.0562417Z #define __need_IOV_MAX 
2025-05-07T20:26:58.0562666Z #define putc(_ch,_fp) _IO_putc (_ch, _fp)
2025-05-07T20:26:58.0562975Z #define __UQUAD_TYPE unsigned long int
2025-05-07T20:26:58.0563266Z #define __cpp_decltype 200707L
2025-05-07T20:26:58.0563534Z #define __BYTE_ORDER __LITTLE_ENDIAN
2025-05-07T20:26:58.0563809Z #define _GLIBCXX_USE_C99 1
2025-05-07T20:26:58.0564073Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1
2025-05-07T20:26:58.0564364Z #define TTY_NAME_MAX 32
2025-05-07T20:26:58.0564673Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val)
2025-05-07T20:26:58.0565051Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:58.0565429Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition)
2025-05-07T20:26:58.0565798Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:26:58.0566095Z #define __LITTLE_ENDIAN 1234
2025-05-07T20:26:58.0566512Z #define STA_PPSTIME 0x0004
2025-05-07T20:26:58.0566762Z #define __import__ 
2025-05-07T20:26:58.0566984Z #define BUFSIZ _IO_BUFSIZ
2025-05-07T20:26:58.0567272Z #define M_SQRT2l 1.414213562373095048801688724209698079L
2025-05-07T20:26:58.0567578Z #define __export__ 
2025-05-07T20:26:58.0567833Z #define __FSID_T_TYPE struct { int __val[2]; }
2025-05-07T20:26:58.0568147Z #define cudaMemAttachHost 0x02
2025-05-07T20:26:58.0568486Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:58.0568835Z #define _GLIBCXX_HAVE_ICONV 1
2025-05-07T20:26:58.0569093Z #define _GLIBCXX_SYMVER 1
2025-05-07T20:26:58.0569348Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:26:58.0569610Z #define _WCHAR_T_DECLARED 
2025-05-07T20:26:58.0569886Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:26:58.0570218Z #define isalpha_l(c,l) __isalpha_l ((c), (l))
2025-05-07T20:26:58.0570531Z #define __cpp_inline_variables 201606L
2025-05-07T20:26:58.0570816Z #define WNOWAIT 0x01000000
2025-05-07T20:26:58.0571062Z #define PLOSS 6
2025-05-07T20:26:58.0571285Z #define M_LN10 2.30258509299404568402
2025-05-07T20:26:58.0571728Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626)
2025-05-07T20:26:58.0572175Z #define EXIT_SUCCESS 0
2025-05-07T20:26:58.0572419Z #define __LDBL_REDIR_DECL(name) 
2025-05-07T20:26:58.0572686Z #define _GLIBCXX_HAVE_STRTOF 1
2025-05-07T20:26:58.0572957Z #define MOD_FREQUENCY ADJ_FREQUENCY
2025-05-07T20:26:58.0573326Z #define __thread__ __thread
2025-05-07T20:26:58.0573569Z #define _GLIBCXX_HAVE_MEMORY_H 1
2025-05-07T20:26:58.0573828Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:26:58.0574091Z #define __SIZEOF_PTHREAD_BARRIER_T 32
2025-05-07T20:26:58.0574495Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:58.0574922Z #define __cudaCDP2StreamWaitEvent_ptsz 
2025-05-07T20:26:58.0575220Z #define _GLIBCXX_HAVE_SINF 1
2025-05-07T20:26:58.0575459Z #define __linux__ 1
2025-05-07T20:26:58.0575679Z #define STA_PPSSIGNAL 0x0100
2025-05-07T20:26:58.0575968Z #define M_LN2l 0.693147180559945309417232121458176568L
2025-05-07T20:26:58.0576407Z #define __S16_TYPE short int
2025-05-07T20:26:58.0576905Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable()
2025-05-07T20:26:58.0577440Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1
2025-05-07T20:26:58.0577812Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1)
2025-05-07T20:26:58.0578182Z #define __COMMON_FUNCTIONS_H__ 
2025-05-07T20:26:58.0578446Z #define UINT_MAX (INT_MAX * 2U + 1U)
2025-05-07T20:26:58.0578704Z #define _T_SIZE_ 
2025-05-07T20:26:58.0578923Z #define LLONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:58.0579214Z #define __cudaCDP2StreamCreateWithFlags 
2025-05-07T20:26:58.0579513Z #define _PSTL_VERSION 12000
2025-05-07T20:26:58.0579781Z #define __noinline__ __attribute__((noinline))
2025-05-07T20:26:58.0580086Z #define __WNOTHREAD 0x20000000
2025-05-07T20:26:58.0580343Z #define _G_va_list __gnuc_va_list
2025-05-07T20:26:58.0580643Z #define M_PI_4l 0.785398163397448309615660845819875721L
2025-05-07T20:26:58.0580941Z #define _IOS_INPUT 1
2025-05-07T20:26:58.0581167Z #define __USE_LARGEFILE64 1
2025-05-07T20:26:58.0581424Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1
2025-05-07T20:26:58.0581708Z #define __INT64_TYPE__ long int
2025-05-07T20:26:58.0581970Z #define _POSIX_SSIZE_MAX 32767
2025-05-07T20:26:58.0582230Z #define __shared__ __location__(shared)
2025-05-07T20:26:58.0582504Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:26:58.0582809Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0)
2025-05-07T20:26:58.0583146Z #define __gid_t_defined 
2025-05-07T20:26:58.0583402Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1
2025-05-07T20:26:58.0583698Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:26:58.0584065Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 
2025-05-07T20:26:58.0584456Z #define _GLIBCXX17_INLINE inline
2025-05-07T20:26:58.0584712Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:26:58.0585035Z #define ___int_size_t_h 
2025-05-07T20:26:58.0585297Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:58.0585608Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:26:58.0585969Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED)
2025-05-07T20:26:58.0586311Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1
2025-05-07T20:26:58.0586582Z #define _GLIBCXX_HAVE_FENV_H 1
2025-05-07T20:26:58.0586849Z #define _GLIBCXX_HAVE_STDBOOL_H 1
2025-05-07T20:26:58.0587115Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:26:58.0587395Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:58.0587715Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1
2025-05-07T20:26:58.0588029Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 
2025-05-07T20:26:58.0588323Z #define __clock_t_defined 1
2025-05-07T20:26:58.0588572Z #define _POSIX_SEM_VALUE_MAX 32767
2025-05-07T20:26:58.0588860Z #define __cudaCDP2RuntimeGetVersion 
2025-05-07T20:26:58.0589140Z #define __GLIBC_MINOR__ 17
2025-05-07T20:26:58.0589384Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:26:58.0589646Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:26:58.0589933Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:26:58.0590216Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:26:58.0590541Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:58.0590890Z #define __SSE__ 1
2025-05-07T20:26:58.0591106Z #define SEM_VALUE_MAX (2147483647)
2025-05-07T20:26:58.0591373Z #define M_SQRT1_2 0.70710678118654752440
2025-05-07T20:26:58.0591641Z #define _CTYPE_H 1
2025-05-07T20:26:58.0591854Z #define __sigset_t_defined 
2025-05-07T20:26:58.0592109Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:26:58.0592372Z #define _GLIBCXX_HAVE_LOGF 1
2025-05-07T20:26:58.0592613Z #define MOD_TAI ADJ_TAI
2025-05-07T20:26:58.0592849Z #define _IO_va_list __gnuc_va_list
2025-05-07T20:26:58.0593116Z #define _GLIBCXX_HAVE_LOGL 1
2025-05-07T20:26:58.0593358Z #define __SM_70_RT_H__ 
2025-05-07T20:26:58.0593587Z #define _GLIBCXX_HAVE_WRITEV 1
2025-05-07T20:26:58.0593858Z #define cudaEventWaitDefault 0x00
2025-05-07T20:26:58.0594222Z #define _GLIBCXX_HAVE_EXPL 1
2025-05-07T20:26:58.0594536Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:58.0594883Z #define _POSIX_MAX_CANON 255
2025-05-07T20:26:58.0595150Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE
2025-05-07T20:26:58.0595243Z #define FD_SETSIZE __FD_SETSIZE
2025-05-07T20:26:58.0595332Z #define _GLIBCXX_TXN_SAFE 
2025-05-07T20:26:58.0595413Z #define __amd64__ 1
2025-05-07T20:26:58.0595499Z #define __WINT_WIDTH__ 32
2025-05-07T20:26:58.0595603Z #define __CUDA_DEVICE_RUNTIME_API_H__ 
2025-05-07T20:26:58.0595865Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:58.0595964Z #define _GLIBCXX_STDIO_SEEK_CUR 1
2025-05-07T20:26:58.0596046Z #define EOF (-1)
2025-05-07T20:26:58.0596141Z #define __WAIT_STATUS_DEFN void *
2025-05-07T20:26:58.0596233Z #define __USE_POSIX199309 1
2025-05-07T20:26:58.0596334Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:26:58.0596436Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:26:58.0596530Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:26:58.0596629Z #define LLONG_MIN (-LLONG_MAX-1)
2025-05-07T20:26:58.0596741Z #define cudaSurfaceType2DLayered 0xF2
2025-05-07T20:26:58.0596831Z #define ____mbstate_t_defined 1
2025-05-07T20:26:58.0596922Z #define STA_NANO 0x2000
2025-05-07T20:26:58.0597015Z #define _GLIBCXX_HAVE_LOG10F 1
2025-05-07T20:26:58.0597106Z #define _GLIBCXX_HAVE_LOG10L 1
2025-05-07T20:26:58.0597193Z #define _IO_LINKED 0x80
2025-05-07T20:26:58.0597289Z #define __cpp_lib_launder 201606
2025-05-07T20:26:58.0597382Z #define __SIZEOF_INT128__ 16
2025-05-07T20:26:58.0597488Z #define __PTHREAD_MUTEX_HAVE_PREV 1
2025-05-07T20:26:58.0597581Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:26:58.0597679Z #define _GLIBCXX_TYPE_TRAITS 1
2025-05-07T20:26:58.0597821Z #define cudaGraphKernelNodePortProgrammatic 1
2025-05-07T20:26:58.0597927Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:58.0598116Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:58.0598216Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:26:58.0598308Z #define __W_CONTINUED 0xffff
2025-05-07T20:26:58.0598401Z #define __ATOMIC_RELAXED 0
2025-05-07T20:26:58.0598530Z #define w_coredump __wait_terminated.__w_coredump
2025-05-07T20:26:58.0598652Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:58.0598852Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 
2025-05-07T20:26:58.0599032Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:26:58.0599121Z #define __stub_stty 
2025-05-07T20:26:58.0599286Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)])
2025-05-07T20:26:58.0599371Z #define le16toh(x) (x)
2025-05-07T20:26:58.0599479Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX
2025-05-07T20:26:58.0599649Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:26:58.0599729Z #define _SIZET_ 
2025-05-07T20:26:58.0599826Z #define XATTR_NAME_MAX 255
2025-05-07T20:26:58.0599916Z #define _SVID_SOURCE 1
2025-05-07T20:26:58.0599996Z #define _LP64 1
2025-05-07T20:26:58.0600092Z #define _LIBC_LIMITS_H_ 1
2025-05-07T20:26:58.0600320Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias)
2025-05-07T20:26:58.0600430Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1
2025-05-07T20:26:58.0600513Z #define __UINT8_C(c) c
2025-05-07T20:26:58.0600606Z #define _GLIBCXX_HAVE_CEILF 1
2025-05-07T20:26:58.0600700Z #define _GLIBCXX_HAVE_CEILL 1
2025-05-07T20:26:58.0600809Z #define __cudaCDP2Memset3DAsync_ptsz 
2025-05-07T20:26:58.0600901Z #define __CUDA_ARCH_LIST__ 520
2025-05-07T20:26:58.0600993Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:26:58.0601088Z #define MOD_MAXERROR ADJ_MAXERROR
2025-05-07T20:26:58.0601171Z #define CUDARTAPI 
2025-05-07T20:26:58.0601255Z #define IOV_MAX 1024
2025-05-07T20:26:58.0601398Z #define __glibcxx_requires_irreflexive2(_First,_Last) 
2025-05-07T20:26:58.0601495Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:26:58.0601601Z #define cudaMemAttachSingle 0x04
2025-05-07T20:26:58.0601816Z #define __wchar_t__ 
2025-05-07T20:26:58.0601971Z #define __cpp_lib_is_aggregate 201703
2025-05-07T20:26:58.0602070Z #define SEEK_END 2
2025-05-07T20:26:58.0602164Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:26:58.0602337Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include(<tbb/tbb.h>)
2025-05-07T20:26:58.0602433Z #define _IO_ftrylockfile(_fp) 
2025-05-07T20:26:58.0602575Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR
2025-05-07T20:26:58.0602667Z #define ____FILE_defined 1
2025-05-07T20:26:58.0602779Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1
2025-05-07T20:26:58.0602874Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:26:58.0602965Z #define _ISOC99_SOURCE 1
2025-05-07T20:26:58.0603057Z #define __VECTOR_FUNCTIONS_H__ 
2025-05-07T20:26:58.0603301Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:58.0603428Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 
2025-05-07T20:26:58.0603516Z #define _IO_RIGHT 04
2025-05-07T20:26:58.0603663Z #define __END_NAMESPACE_STD 
2025-05-07T20:26:58.0603887Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:58.0603982Z #define _GLIBCXX_STD_C std
2025-05-07T20:26:58.0604107Z #define cudaInitDeviceFlagsAreValid 0x01
2025-05-07T20:26:58.0604208Z #define _LARGEFILE64_SOURCE 1
2025-05-07T20:26:58.0604312Z #define _GLIBCXX_USE_C99_STDINT_TR1 1
2025-05-07T20:26:58.0604401Z #define _STDDEF_H_ 
2025-05-07T20:26:58.0604572Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:58.0604675Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:26:58.0604794Z #define isalnum_l(c,l) __isalnum_l ((c), (l))
2025-05-07T20:26:58.0604990Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0)
2025-05-07T20:26:58.0605109Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:58.0605253Z #define __glibcxx_requires_irreflexive(_First,_Last) 
2025-05-07T20:26:58.0605507Z #define cudaGraphKernelNodePortDefault 0
2025-05-07T20:26:58.0605626Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:26:58.0605741Z #define __cudaCDP2Memcpy3DAsync_ptsz 
2025-05-07T20:26:58.0605838Z #define __PID_T_TYPE __S32_TYPE
2025-05-07T20:26:58.0605973Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:26:58.0606079Z #define CHARCLASS_NAME_MAX 2048
2025-05-07T20:26:58.0606203Z #define _GLIBCXX_HAVE_TANF 1
2025-05-07T20:26:58.0606302Z #define _GLIBCXX_USE_ST_MTIM 1
2025-05-07T20:26:58.0606476Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:26:58.0606574Z #define __CUDA_RUNTIME_H__ 
2025-05-07T20:26:58.0606673Z #define _GLIBCXX_HAVE_STDLIB_H 1
2025-05-07T20:26:58.0606769Z #define __STDCPP_THREADS__ 1
2025-05-07T20:26:58.0606918Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L
2025-05-07T20:26:58.0607015Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:26:58.0607109Z #define _POSIX_UIO_MAXIOV 16
2025-05-07T20:26:58.0607219Z #define _PSTL_PAR_BACKEND_SERIAL 
2025-05-07T20:26:58.0607320Z #define P_tmpdir "/tmp"
2025-05-07T20:26:58.0607441Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__
2025-05-07T20:26:58.0607540Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:26:58.0607644Z #define __WORDSIZE_TIME64_COMPAT32 1
2025-05-07T20:26:58.0607812Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__))
2025-05-07T20:26:58.0607984Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:26:58.0608084Z #define _PSTL_HIDE_FROM_ABI_PUSH 
2025-05-07T20:26:58.0608206Z #define cudaStreamLegacy ((cudaStream_t)0x1)
2025-05-07T20:26:58.0608321Z #define _IO_cleanup_region_start(_fct,_fp) 
2025-05-07T20:26:58.0608432Z #define __location__(a) __annotate__(a)
2025-05-07T20:26:58.0608662Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type)
2025-05-07T20:26:58.0608761Z #define _POSIX2_BC_BASE_MAX 99
2025-05-07T20:26:58.0608880Z #define __cudaCDP2DeviceGetAttribute 
2025-05-07T20:26:58.0608980Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:26:58.0609159Z #define __STDC_UTF_32__ 1
2025-05-07T20:26:58.0609257Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:26:58.0609357Z #define NAN (__builtin_nanf (""))
2025-05-07T20:26:58.0609453Z #define _POSIX_MQ_PRIO_MAX 32
2025-05-07T20:26:58.0609540Z #define __FXSR__ 1
2025-05-07T20:26:58.0609622Z #define _SIZE_T 
2025-05-07T20:26:58.0609726Z #define _GLIBCXX_USE_GETTIMEOFDAY 1
2025-05-07T20:26:58.0609844Z #define cudaHostRegisterReadOnly 0x08
2025-05-07T20:26:58.0610013Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:58.0610163Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f)
2025-05-07T20:26:58.0610264Z #define _IO_ssize_t __ssize_t
2025-05-07T20:26:58.0610366Z #define __ULONG32_TYPE unsigned int
2025-05-07T20:26:58.0610553Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:58.0610755Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000
2025-05-07T20:26:58.0610849Z #define _GXX_NULLPTR_T 
2025-05-07T20:26:58.0610982Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 
2025-05-07T20:26:58.0611072Z #define FOPEN_MAX 16
2025-05-07T20:26:58.0611163Z #define __BIG_ENDIAN 4321
2025-05-07T20:26:58.0611288Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:58.0611386Z #define __suseconds_t_defined 
2025-05-07T20:26:58.0611473Z #define __off_t_defined 
2025-05-07T20:26:58.0611564Z #define stderr stderr
2025-05-07T20:26:58.0611661Z #define M_LOG10E 0.43429448190325182765
2025-05-07T20:26:58.0611773Z #define __glibcxx_requires_string(_String) 
2025-05-07T20:26:58.0611879Z #define _GLIBCXX_HAVE_LDEXPL 1
2025-05-07T20:26:58.0611972Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:26:58.0612375Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304)
2025-05-07T20:26:58.0612469Z #define __mode_t_defined 
2025-05-07T20:26:58.0612551Z #define _GCC_SIZE_T 
2025-05-07T20:26:58.0612737Z #define __INO64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:58.0612848Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:26:58.0612958Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:26:58.0613112Z #define __USE_XOPEN2K8XSI 1
2025-05-07T20:26:58.0613211Z #define __UINT32_C(c) c ## U
2025-05-07T20:26:58.0613319Z #define __cpp_alias_templates 200704L
2025-05-07T20:26:58.0613434Z #define cudaHostAllocMapped 0x02
2025-05-07T20:26:58.0613549Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 
2025-05-07T20:26:58.0613648Z #define _STL_ITERATOR_H 1
2025-05-07T20:26:58.0613734Z #define __size_t__ 
2025-05-07T20:26:58.0613869Z #define cudaStreamAttrID cudaLaunchAttributeID
2025-05-07T20:26:58.0613971Z #define _GLIBCXX_HAVE_ATANF 1
2025-05-07T20:26:58.0614148Z #define cudaEventRecordExternal 0x01
2025-05-07T20:26:58.0614443Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l))
2025-05-07T20:26:58.0614590Z #define _IO_BUFSIZ _G_BUFSIZ
2025-05-07T20:26:58.0614800Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:26:58.0619410Z #define _ENDIAN_H 1
2025-05-07T20:26:58.0619544Z #define __builtin_align__(a) __align__(a)
2025-05-07T20:26:58.0619645Z #define _GLIBCXX20_CONSTEXPR 
2025-05-07T20:26:58.0619751Z #define __NV_NO_HOST_COMPILER_CHECK 1
2025-05-07T20:26:58.0619831Z #define __try try
2025-05-07T20:26:58.0619933Z #define _GLIBCXX_HAVE_FINITE 1
2025-05-07T20:26:58.0620025Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:26:58.0620112Z #define __INT8_MAX__ 0x7f
2025-05-07T20:26:58.0620375Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2)
2025-05-07T20:26:58.0620462Z #define __LONG_WIDTH__ 64
2025-05-07T20:26:58.0620540Z #define __PIC__ 2
2025-05-07T20:26:58.0620656Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX
2025-05-07T20:26:58.0620773Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:26:58.0620900Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp)
2025-05-07T20:26:58.0620997Z #define _GLIBCXX_HAVE_FLOAT_H 1
2025-05-07T20:26:58.0621095Z #define _GLIBCXX_HAVE_ATANL 1
2025-05-07T20:26:58.0621421Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:58.0621519Z #define __DEVICE_FUNCTIONS_HPP__ 
2025-05-07T20:26:58.0621618Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:26:58.0621711Z #define _IO_uid_t __uid_t
2025-05-07T20:26:58.0621807Z #define _GLIBCXX_HAVE_READLINK 1
2025-05-07T20:26:58.0621932Z #define __cudaCDP2EventRecordWithFlags_ptsz 
2025-05-07T20:26:58.0622027Z #define _CONCEPT_CHECK_H 1
2025-05-07T20:26:58.0622174Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:58.0622274Z #define _GLIBCXX_HAVE_NETINET_IN_H 1
2025-05-07T20:26:58.0622395Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1
2025-05-07T20:26:58.0622475Z #define LONG_BIT 64
2025-05-07T20:26:58.0622582Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4
2025-05-07T20:26:58.0622679Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1
2025-05-07T20:26:58.0622808Z #define __cpp_lib_math_special_functions 201603L
2025-05-07T20:26:58.0622913Z #define __fsfilcnt_t_defined 
2025-05-07T20:26:58.0623008Z #define __blkcnt_t_defined 
2025-05-07T20:26:58.0623344Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:58.0623478Z #define __USE_LARGEFILE 1
2025-05-07T20:26:58.0623607Z #define __cpp_constexpr 201603L
2025-05-07T20:26:58.0623700Z #define CUDART_VERSION 12060
2025-05-07T20:26:58.0623791Z #define NL_TEXTMAX INT_MAX
2025-05-07T20:26:58.0623888Z #define cudaDeviceMapHost 0x08
2025-05-07T20:26:58.0623976Z #define _GLIBCXX_CMATH 1
2025-05-07T20:26:58.0624171Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x)))
2025-05-07T20:26:58.0624260Z #define __lldiv_t_defined 1
2025-05-07T20:26:58.0624343Z #define __SSE2__ 1
2025-05-07T20:26:58.0624422Z #define _IOLBF 1
2025-05-07T20:26:58.0624519Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1
2025-05-07T20:26:58.0624615Z #define _GLIBCXX_HAVE_FLOORF 1
2025-05-07T20:26:58.0624718Z #define __cpp_deduction_guides 201703L
2025-05-07T20:26:58.0625425Z #define _GLIBCXX_HAVE_EXPF 1
2025-05-07T20:26:58.0625550Z #define __annotate__(a) __attribute__((a))
2025-05-07T20:26:58.0625641Z #define __INT32_TYPE__ int
2025-05-07T20:26:58.0625753Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:26:58.0625875Z #define cudaDeviceSyncMemops 0x80
2025-05-07T20:26:58.0625989Z #define __cpp_exceptions 199711L
2025-05-07T20:26:58.0626086Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:26:58.0626194Z #define cudaDeviceScheduleYield 0x02
2025-05-07T20:26:58.0626286Z #define _SYS_SYSMACROS_H 1
2025-05-07T20:26:58.0626410Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1
2025-05-07T20:26:58.0626568Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:26:58.0626664Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:26:58.0626757Z #define __SWORD_TYPE long int
2025-05-07T20:26:58.0626850Z #define __INTMAX_TYPE__ long int
2025-05-07T20:26:58.0626955Z #define _GLIBCXX11_USE_C99_MATH 1
2025-05-07T20:26:58.0627046Z #define __PTHREAD_SPINS 0, 0
2025-05-07T20:26:58.0627145Z #define _BITS_POSIX1_LIM_H 1
2025-05-07T20:26:58.0627434Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:58.0627526Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:26:58.0627676Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT)
2025-05-07T20:26:58.0627752Z #define _T_SIZE 
2025-05-07T20:26:58.0627857Z #define cudaHostAllocDefault 0x00
2025-05-07T20:26:58.0627984Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:58.0628106Z #define __va_arg_pack() __builtin_va_arg_pack ()
2025-05-07T20:26:58.0628195Z #define _POSIX_TIMER_MAX 32
2025-05-07T20:26:58.0628288Z #define _GLIBCXX_HAVE_TLS 1
2025-05-07T20:26:58.0628407Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT
2025-05-07T20:26:58.0628501Z #define _GLIBCXX_HAVE_ACOSL 1
2025-05-07T20:26:58.0628601Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:58.0628691Z #define __ATOMIC_CONSUME 1
2025-05-07T20:26:58.0628865Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT
2025-05-07T20:26:58.0628960Z #define __GNUC_MINOR__ 4
2025-05-07T20:26:58.0629149Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:26:58.0629245Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:26:58.0629360Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:58.0629441Z #define __PIE__ 2
2025-05-07T20:26:58.0629546Z #define LITTLE_ENDIAN __LITTLE_ENDIAN
2025-05-07T20:26:58.0629642Z #define _GLIBCXX_HAVE_INT64_T_LONG 1
2025-05-07T20:26:58.0629831Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:26:58.0630051Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:58.0630143Z #define __nlink_t_defined 
2025-05-07T20:26:58.0630268Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]]
2025-05-07T20:26:58.0630382Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x)
2025-05-07T20:26:58.0630469Z #define _XOPEN_LIM_H 1
2025-05-07T20:26:58.0630726Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:58.0630846Z #define __cpp_template_template_args 201611L
2025-05-07T20:26:58.0630957Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1
2025-05-07T20:26:58.0631059Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX
2025-05-07T20:26:58.0631152Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:26:58.0631237Z #define __FILE_defined 1
2025-05-07T20:26:58.0631413Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:26:58.0631511Z #define _GLIBCXX_HAVE_SINCOS 1
2025-05-07T20:26:58.0631605Z #define __USE_XOPEN_EXTENDED 1
2025-05-07T20:26:58.0631715Z #define __cpp_lib_tuple_element_t 201402L
2025-05-07T20:26:58.0631829Z #define isascii_l(c,l) __isascii_l ((c), (l))
2025-05-07T20:26:58.0631936Z #define cudaInvalidDeviceId ((int)-2)
2025-05-07T20:26:58.0632037Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1
2025-05-07T20:26:58.0632120Z #define __INT16_C(c) c
2025-05-07T20:26:58.0632218Z #define __U32_TYPE unsigned int
2025-05-07T20:26:58.0632315Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1
2025-05-07T20:26:58.0632540Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp)
2025-05-07T20:26:58.0632626Z #define __STDC__ 1
2025-05-07T20:26:58.0632722Z #define _GLIBCXX_HAVE_VWSCANF 1
2025-05-07T20:26:58.0632819Z #define _GLIBCXX_HAVE_EXECINFO_H 1
2025-05-07T20:26:58.0632915Z #define _GLIBCXX_USE_REALPATH 1
2025-05-07T20:26:58.0633064Z #define __attribute_malloc__ __attribute__ ((__malloc__))
2025-05-07T20:26:58.0633159Z #define __FLT32X_DIG__ 15
2025-05-07T20:26:58.0633257Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1
2025-05-07T20:26:58.0633355Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:26:58.0633470Z #define cudaArrayDeferredMapping 0x80
2025-05-07T20:26:58.0633575Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 
2025-05-07T20:26:58.0633669Z #define USHRT_MAX (SHRT_MAX * 2 + 1)
2025-05-07T20:26:58.0633775Z #define __cpp_lib_is_swappable 201603
2025-05-07T20:26:58.0633855Z #define stdin stdin
2025-05-07T20:26:58.0633944Z #define __ino64_t_defined 
2025-05-07T20:26:58.0634032Z #define STA_CLK 0x8000
2025-05-07T20:26:58.0634129Z #define __clockid_t_defined 1
2025-05-07T20:26:58.0634277Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__)
2025-05-07T20:26:58.0634444Z #define __attribute_noinline__ __attribute__ ((__noinline__))
2025-05-07T20:26:58.0634545Z #define __cudaCDP2MemsetAsync 
2025-05-07T20:26:58.0634646Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 
2025-05-07T20:26:58.0634749Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 
2025-05-07T20:26:58.0634852Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1
2025-05-07T20:26:58.0635051Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d)))
2025-05-07T20:26:58.0635143Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:26:58.0635675Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; }))
2025-05-07T20:26:58.0635777Z #define DOMAIN 1
2025-05-07T20:26:58.0635879Z #define M_LN2 0.69314718055994530942
2025-05-07T20:26:58.0635983Z #define __NVCC__ 1
2025-05-07T20:26:58.0636171Z #define __cudaCDP2Memset2DAsync 
2025-05-07T20:26:58.0636283Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:58.0636388Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 
2025-05-07T20:26:58.0636488Z #define __throw_exception_again throw
2025-05-07T20:26:58.0636582Z #define M_SQRT2 1.41421356237309504880
2025-05-07T20:26:58.0636672Z #define __EXCEPTION_H 1
2025-05-07T20:26:58.0636769Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:26:58.0636870Z #define HUGE_VAL (__builtin_huge_val())
2025-05-07T20:26:58.0637173Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:58.0637288Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:26:58.0637391Z #define _GLIBCXX_INLINE_VERSION 0
2025-05-07T20:26:58.0637485Z #define _GLIBCXX_USE_INT128 1
2025-05-07T20:26:58.0637589Z #define __cpp_lib_bool_constant 201505
2025-05-07T20:26:58.0637689Z #define PTHREAD_KEYS_MAX 1024
2025-05-07T20:26:58.0637833Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:26:58.0637944Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:58.0638056Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1
2025-05-07T20:26:58.0638150Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:26:58.0638253Z #define __cpp_lib_tuples_by_type 201304
2025-05-07T20:26:58.0638349Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:26:58.0638448Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:26:58.0638581Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC))
2025-05-07T20:26:58.0638675Z #define __useconds_t_defined 
2025-05-07T20:26:58.0638773Z #define _GLIBCXX_USE_SCHED_YIELD 1
2025-05-07T20:26:58.0638958Z #define __attribute_deprecated__ __attribute__ ((__deprecated__))
2025-05-07T20:26:58.0639104Z #define __cpp_lib_type_trait_variable_templates 201510L
2025-05-07T20:26:58.0639190Z #define __SSE_MATH__ 1
2025-05-07T20:26:58.0639282Z #define _IO_wint_t wint_t
2025-05-07T20:26:58.0639374Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:26:58.0639551Z #define _GLIBCXX_VERBOSE 1
2025-05-07T20:26:58.0639656Z #define _GLIBCXX_HAVE_ASINF 1
2025-05-07T20:26:58.0639766Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:26:58.0639859Z #define _GLIBCXX_HAVE_ISINFL 1
2025-05-07T20:26:58.0639953Z #define _GLIBCXX_HAVE_ASINL 1
2025-05-07T20:26:58.0640034Z #define __USE_ATFILE 1
2025-05-07T20:26:58.0640125Z #define _POSIX_OPEN_MAX 20
2025-05-07T20:26:58.0640219Z #define _POSIX_LOGIN_NAME_MAX 9
2025-05-07T20:26:58.0640305Z #define _GCC_PTRDIFF_T 
2025-05-07T20:26:58.0640531Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:58.0640625Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:26:58.0640725Z #define _POSIX_THREAD_KEYS_MAX 128
2025-05-07T20:26:58.0640829Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:26:58.0640938Z #define __cpp_lib_array_constexpr 201803L
2025-05-07T20:26:58.0641016Z #define _STDLIB_H 1
2025-05-07T20:26:58.0641158Z #define __exctype(name) extern int name (int) __THROW
2025-05-07T20:26:58.0641261Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:26:58.0641355Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:26:58.0641484Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:58.0641591Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:58.0641687Z #define __SM_61_INTRINSICS_H__ 
2025-05-07T20:26:58.0641866Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused"
2025-05-07T20:26:58.0642017Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l))
2025-05-07T20:26:58.0642124Z #define __glibcxx_requires_nonempty() 
2025-05-07T20:26:58.0642236Z #define w_stopsig __wait_stopped.__w_stopsig
2025-05-07T20:26:58.0642325Z #define __ldiv_t_defined 1
2025-05-07T20:26:58.0642503Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 
2025-05-07T20:26:58.0642594Z #define ___int_ptrdiff_t_h 
2025-05-07T20:26:58.0642766Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:58.0642876Z #define __cudaCDP2EventDestroy 
2025-05-07T20:26:58.0642973Z #define __HOST_DEFINES_H__ 
2025-05-07T20:26:58.0643155Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:26:58.0643252Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:58.0643350Z #define _GLIBCXX_USE_NANOSLEEP 1
2025-05-07T20:26:58.0643434Z #define CUDART_CB 
2025-05-07T20:26:58.0643534Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX
2025-05-07T20:26:58.0643656Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1
2025-05-07T20:26:58.0643742Z #define MB_LEN_MAX 16
2025-05-07T20:26:58.0643960Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:58.0644058Z #define _GLIBCXX11_USE_C99_WCHAR 1
2025-05-07T20:26:58.0644181Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp)
2025-05-07T20:26:58.0644295Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1
2025-05-07T20:26:58.0644393Z #define _GLIBCXX_HAVE_UNISTD_H 1
2025-05-07T20:26:58.0644538Z #define __glibc_likely(cond) __builtin_expect((cond), 1)
2025-05-07T20:26:58.0644645Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:26:58.0644742Z #define _GNU_SOURCE 1
2025-05-07T20:26:58.0644831Z #define __stub_putmsg 
2025-05-07T20:26:58.0644913Z #define __CUDACC__ 1
2025-05-07T20:26:58.0645004Z #define __N(msgid) (msgid)
2025-05-07T20:26:58.0645086Z #define __P(args) args
2025-05-07T20:26:58.0645336Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative
2025-05-07T20:26:58.0645439Z #define __cpp_init_captures 201304L
2025-05-07T20:26:58.0645541Z #define _GLIBCXX17_CONSTEXPR constexpr
2025-05-07T20:26:58.0645629Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:26:58.0645730Z #define __cpp_lib_as_const 201510
2025-05-07T20:26:58.0645810Z #define __WCHAR_T 
2025-05-07T20:26:58.0645906Z #define __ATOMIC_RELEASE 3
2025-05-07T20:26:58.0645999Z #define __fsblkcnt_t_defined 
2025-05-07T20:26:58.0646113Z #define __cudaCDP2EventCreateWithFlags 
2025-05-07T20:26:58.0646216Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 
2025-05-07T20:26:58.0646222Z 
2025-05-07T20:26:58.0696743Z 
2025-05-07T20:26:58.0697122Z + conda run -n build_binary nvcc --version
2025-05-07T20:26:58.0697134Z 
2025-05-07T20:26:59.9589067Z nvcc: NVIDIA (R) Cuda compiler driver
2025-05-07T20:26:59.9589457Z Copyright (c) 2005-2024 NVIDIA Corporation
2025-05-07T20:26:59.9589771Z Built on Tue_Oct_29_23:50:19_PDT_2024
2025-05-07T20:26:59.9590090Z Cuda compilation tools, release 12.6, V12.6.85
2025-05-07T20:26:59.9590408Z Build cuda_12.6.r12.6/compiler.35059454_0
2025-05-07T20:26:59.9590620Z 
2025-05-07T20:27:00.0226538Z 
2025-05-07T20:27:00.0236872Z /usr/bin/nvidia-smi
2025-05-07T20:27:00.0242142Z + nvidia-smi
2025-05-07T20:27:00.0242387Z 
2025-05-07T20:27:00.0415805Z Wed May  7 20:27:00 2025       
2025-05-07T20:27:00.0416179Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:00.0416687Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:27:00.0417224Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:27:00.0417729Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:27:00.0418246Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:27:00.0418667Z |                                         |                        |               MIG M. |
2025-05-07T20:27:00.0418998Z |=========================================+========================+======================|
2025-05-07T20:27:00.0587333Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:27:00.0587779Z |  0%   26C    P8             15W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:27:00.0588149Z |                                         |                        |                  N/A |
2025-05-07T20:27:00.0588535Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:27:00.0592182Z                                                                                          
2025-05-07T20:27:00.0592810Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:00.0593241Z | Processes:                                                                              |
2025-05-07T20:27:00.0593678Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:27:00.0594093Z |        ID   ID                                                               Usage      |
2025-05-07T20:27:00.0594426Z |=========================================================================================|
2025-05-07T20:27:00.0598521Z |  No running processes found                                                             |
2025-05-07T20:27:00.0598990Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:00.3120509Z 
2025-05-07T20:27:00.3125709Z [INSTALL] Successfully installed CUDA 12.6.3
2025-05-07T20:27:00.3178846Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3
2025-05-07T20:27:00.3179421Z [36;1m. $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3[0m
2025-05-07T20:27:00.3191350Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:27:00.3191694Z env:
2025-05-07T20:27:00.3191925Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:27:00.3192211Z   BUILD_ENV: build_binary
2025-05-07T20:27:00.3192465Z   BUILD_TARGET: genai
2025-05-07T20:27:00.3192694Z   BUILD_VARIANT: cuda
2025-05-07T20:27:00.3192931Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:27:00.3193183Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:27:00.3193487Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:27:00.3193816Z ##[endgroup]
2025-05-07T20:27:00.6540813Z ################################################################################
2025-05-07T20:27:00.6541192Z # Install PyTorch (PIP)
2025-05-07T20:27:00.6541430Z #
2025-05-07T20:27:00.6557560Z # [2025-05-07T20:27:00.655Z] + install_pytorch_pip build_binary nightly cuda/12.6.3
2025-05-07T20:27:00.6558029Z ################################################################################
2025-05-07T20:27:00.6558247Z 
2025-05-07T20:27:00.6586842Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y numpy
2025-05-07T20:27:01.6505978Z Channels:
2025-05-07T20:27:01.6506229Z  - conda-forge
2025-05-07T20:27:01.6506452Z Platform: linux-64
2025-05-07T20:27:05.1313655Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:27:05.8633797Z Solving environment: \ | / done
2025-05-07T20:27:06.0844530Z 
2025-05-07T20:27:06.0844801Z ## Package Plan ##
2025-05-07T20:27:06.0845072Z 
2025-05-07T20:27:06.0845449Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:27:06.0846206Z 
2025-05-07T20:27:06.0846459Z   added / updated specs:
2025-05-07T20:27:06.0847076Z     - numpy
2025-05-07T20:27:06.0847358Z 
2025-05-07T20:27:06.0847396Z 
2025-05-07T20:27:06.0847690Z The following packages will be downloaded:
2025-05-07T20:27:06.0848212Z 
2025-05-07T20:27:06.0848445Z     package                    |            build
2025-05-07T20:27:06.0849043Z     ---------------------------|-----------------
2025-05-07T20:27:06.0849515Z     libblas-3.9.0              |31_h59b9bed_openblas          16 KB  conda-forge
2025-05-07T20:27:06.0849973Z     libcblas-3.9.0             |31_he106b2a_openblas          16 KB  conda-forge
2025-05-07T20:27:06.0850426Z     libgfortran-15.1.0         |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:27:06.0850887Z     libgfortran5-15.1.0        |       hcea5267_2         1.5 MB  conda-forge
2025-05-07T20:27:06.0851353Z     liblapack-3.9.0            |31_h7ac8fdf_openblas          16 KB  conda-forge
2025-05-07T20:27:06.0851832Z     libopenblas-0.3.29         |pthreads_h94d23a6_0         5.6 MB  conda-forge
2025-05-07T20:27:06.0852286Z     numpy-2.2.5                |  py312h72c5963_0         8.1 MB  conda-forge
2025-05-07T20:27:06.0852683Z     ------------------------------------------------------------
2025-05-07T20:27:06.0853350Z                                            Total:        15.4 MB
2025-05-07T20:27:06.0853564Z 
2025-05-07T20:27:06.0853695Z The following NEW packages will be INSTALLED:
2025-05-07T20:27:06.0853925Z 
2025-05-07T20:27:06.0854139Z   libblas            conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 
2025-05-07T20:27:06.0854647Z   libcblas           conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 
2025-05-07T20:27:06.0855164Z   libgfortran        conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 
2025-05-07T20:27:06.0855670Z   libgfortran5       conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 
2025-05-07T20:27:06.0856193Z   liblapack          conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 
2025-05-07T20:27:06.0856741Z   libopenblas        conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 
2025-05-07T20:27:06.0857471Z   numpy              conda-forge/linux-64::numpy-2.2.5-py312h72c5963_0 
2025-05-07T20:27:06.0857759Z 
2025-05-07T20:27:06.0857764Z 
2025-05-07T20:27:06.0857768Z 
2025-05-07T20:27:06.0857913Z Downloading and Extracting Packages: ...working...
2025-05-07T20:27:06.0858293Z numpy-2.2.5          | 8.1 MB    |            |   0% 
2025-05-07T20:27:06.0858521Z 
2025-05-07T20:27:06.0858789Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:27:06.0859035Z 
2025-05-07T20:27:06.0859039Z 
2025-05-07T20:27:06.0886765Z libgfortran5-15.1.0  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:27:06.0887038Z 
2025-05-07T20:27:06.0887042Z 
2025-05-07T20:27:06.0888043Z 
2025-05-07T20:27:06.0900619Z libgfortran-15.1.0   | 34 KB     |            |   0% [A[A[A
2025-05-07T20:27:06.0900968Z 
2025-05-07T20:27:06.0900974Z 
2025-05-07T20:27:06.0900980Z 
2025-05-07T20:27:06.0906035Z 
2025-05-07T20:27:06.0918399Z libblas-3.9.0        | 16 KB     |            |   0% [A[A[A[A
2025-05-07T20:27:06.0918776Z 
2025-05-07T20:27:06.0918797Z 
2025-05-07T20:27:06.0918804Z 
2025-05-07T20:27:06.0918810Z 
2025-05-07T20:27:06.0925789Z 
2025-05-07T20:27:06.0931421Z libcblas-3.9.0       | 16 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:27:06.0931687Z 
2025-05-07T20:27:06.0931691Z 
2025-05-07T20:27:06.0931695Z 
2025-05-07T20:27:06.0931699Z 
2025-05-07T20:27:06.0931702Z 
2025-05-07T20:27:06.0934661Z 
2025-05-07T20:27:06.3370808Z liblapack-3.9.0      | 16 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:27:06.3371088Z 
2025-05-07T20:27:06.3371093Z 
2025-05-07T20:27:06.3371664Z 
2025-05-07T20:27:06.3392801Z libgfortran-15.1.0   | 34 KB     | ####7      |  47% [A[A[A
2025-05-07T20:27:06.3393066Z 
2025-05-07T20:27:06.3393070Z 
2025-05-07T20:27:06.3393074Z 
2025-05-07T20:27:06.3393078Z 
2025-05-07T20:27:06.3428581Z libblas-3.9.0        | 16 KB     | #########7 |  97% [A[A[A[A
2025-05-07T20:27:06.3428835Z 
2025-05-07T20:27:06.3428839Z 
2025-05-07T20:27:06.3428842Z 
2025-05-07T20:27:06.3428846Z 
2025-05-07T20:27:06.3474507Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:06.3474811Z 
2025-05-07T20:27:06.3474816Z 
2025-05-07T20:27:06.3479338Z 
2025-05-07T20:27:06.3723728Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:27:06.3724019Z 
2025-05-07T20:27:06.3724024Z 
2025-05-07T20:27:06.3724027Z 
2025-05-07T20:27:06.3724031Z 
2025-05-07T20:27:06.3743566Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:06.3743830Z 
2025-05-07T20:27:06.3743834Z 
2025-05-07T20:27:06.3743838Z 
2025-05-07T20:27:06.3781925Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:27:06.3816164Z numpy-2.2.5          | 8.1 MB    |            |   0% 
2025-05-07T20:27:06.3816402Z 
2025-05-07T20:27:06.3816406Z 
2025-05-07T20:27:06.3816410Z 
2025-05-07T20:27:06.3816413Z 
2025-05-07T20:27:06.3816418Z 
2025-05-07T20:27:06.3816421Z 
2025-05-07T20:27:06.3821699Z liblapack-3.9.0      | 16 KB     | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:27:06.3822114Z 
2025-05-07T20:27:06.3822120Z 
2025-05-07T20:27:06.3822399Z 
2025-05-07T20:27:06.3822405Z 
2025-05-07T20:27:06.3822410Z 
2025-05-07T20:27:06.3822415Z 
2025-05-07T20:27:06.3985045Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:27:06.3985430Z 
2025-05-07T20:27:06.4023083Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:27:06.4023454Z 
2025-05-07T20:27:06.4024105Z 
2025-05-07T20:27:06.4024112Z 
2025-05-07T20:27:06.4024117Z 
2025-05-07T20:27:06.4024122Z 
2025-05-07T20:27:06.4033393Z libcblas-3.9.0       | 16 KB     | #########7 |  98% [A[A[A[A[A
2025-05-07T20:27:06.4033767Z 
2025-05-07T20:27:06.4033772Z 
2025-05-07T20:27:06.4033778Z 
2025-05-07T20:27:06.4033783Z 
2025-05-07T20:27:06.4033788Z 
2025-05-07T20:27:06.4052915Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:27:06.4053779Z 
2025-05-07T20:27:06.4053784Z 
2025-05-07T20:27:06.4053788Z 
2025-05-07T20:27:06.4053792Z 
2025-05-07T20:27:06.4053796Z 
2025-05-07T20:27:06.4054149Z 
2025-05-07T20:27:06.4142661Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:27:06.4143055Z 
2025-05-07T20:27:06.4143264Z 
2025-05-07T20:27:06.4430423Z libgfortran5-15.1.0  | 1.5 MB    | 1          |   1% [A[A
2025-05-07T20:27:06.4430802Z 
2025-05-07T20:27:06.4430807Z 
2025-05-07T20:27:06.4430813Z 
2025-05-07T20:27:06.4430818Z 
2025-05-07T20:27:06.4430823Z 
2025-05-07T20:27:06.4599903Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:27:06.4600281Z 
2025-05-07T20:27:06.4600287Z 
2025-05-07T20:27:06.4782563Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:06.5094399Z numpy-2.2.5          | 8.1 MB    | #######8   |  79% 
2025-05-07T20:27:06.5097106Z 
2025-05-07T20:27:06.5097586Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:06.5097936Z 
2025-05-07T20:27:06.5284035Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:06.5665535Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:27:06.5666332Z 
2025-05-07T20:27:06.5666339Z 
2025-05-07T20:27:06.5671812Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:06.5672167Z 
2025-05-07T20:27:06.5672173Z 
2025-05-07T20:27:06.6925987Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:06.6926709Z 
2025-05-07T20:27:06.9803406Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:06.9810540Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:27:06.9811010Z                                                      
2025-05-07T20:27:06.9811275Z 
2025-05-07T20:27:06.9811530Z                                                      [A
2025-05-07T20:27:06.9811798Z 
2025-05-07T20:27:06.9811804Z 
2025-05-07T20:27:06.9812036Z                                                      [A[A
2025-05-07T20:27:06.9812241Z 
2025-05-07T20:27:06.9812245Z 
2025-05-07T20:27:06.9812249Z 
2025-05-07T20:27:06.9812480Z                                                      [A[A[A
2025-05-07T20:27:06.9812766Z 
2025-05-07T20:27:06.9812770Z 
2025-05-07T20:27:06.9812774Z 
2025-05-07T20:27:06.9812777Z 
2025-05-07T20:27:06.9812956Z                                                      [A[A[A[A
2025-05-07T20:27:06.9813257Z 
2025-05-07T20:27:06.9813260Z 
2025-05-07T20:27:06.9813273Z 
2025-05-07T20:27:06.9813277Z 
2025-05-07T20:27:06.9813281Z 
2025-05-07T20:27:06.9813457Z                                                      [A[A[A[A[A
2025-05-07T20:27:06.9813670Z 
2025-05-07T20:27:06.9813673Z 
2025-05-07T20:27:06.9813677Z 
2025-05-07T20:27:06.9813681Z 
2025-05-07T20:27:06.9813684Z 
2025-05-07T20:27:06.9813688Z 
2025-05-07T20:27:06.9813875Z                                                      [A[A[A[A[A[A done
2025-05-07T20:27:07.0816244Z Preparing transaction: \ done
2025-05-07T20:27:07.2824995Z Verifying transaction: / - done
2025-05-07T20:27:07.3833985Z Executing transaction: | done
2025-05-07T20:27:07.5598670Z ################################################################################
2025-05-07T20:27:07.5599412Z # Install Package From PyTorch PIP: torch
2025-05-07T20:27:07.5599845Z #
2025-05-07T20:27:07.5617503Z # [2025-05-07T20:27:07.561Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.6.3
2025-05-07T20:27:07.5618192Z ################################################################################
2025-05-07T20:27:07.5618497Z 
2025-05-07T20:27:07.5634938Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:27:07.6522114Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:27:07.6522486Z ################################################################################
2025-05-07T20:27:07.6522823Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:27:07.6523120Z #
2025-05-07T20:27:07.6541003Z # [2025-05-07T20:27:07.653Z] + __prepare_pip_arguments torch nightly cuda/12.6.3
2025-05-07T20:27:07.6541673Z ################################################################################
2025-05-07T20:27:07.6564767Z 
2025-05-07T20:27:07.6565024Z [INSTALL] Extracted package (channel, version): (nightly, LATEST)
2025-05-07T20:27:07.6590912Z [INSTALL] Extracted package variant: cu126
2025-05-07T20:27:07.6607075Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:27:07.6607624Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:27:07.6614742Z [INSTALL] Extracted the full PIP package: --pre torch
2025-05-07T20:27:07.6622895Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu126/ ...
2025-05-07T20:27:07.6643454Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:28:30.4022457Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:28:30.4022948Z Collecting torch
2025-05-07T20:28:30.4023640Z   Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (30 kB)
2025-05-07T20:28:30.4024362Z Collecting filelock (from torch)
2025-05-07T20:28:30.4024869Z   Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB)
2025-05-07T20:28:30.4025910Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from torch) (4.13.2)
2025-05-07T20:28:30.4026966Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from torch) (78.1.1)
2025-05-07T20:28:30.4027627Z Collecting sympy>=1.13.3 (from torch)
2025-05-07T20:28:30.4028156Z   Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB)
2025-05-07T20:28:30.4028982Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 36.1 MB/s eta 0:00:00
2025-05-07T20:28:30.4029341Z Collecting networkx (from torch)
2025-05-07T20:28:30.4029848Z   Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB)
2025-05-07T20:28:30.4030504Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 16.9 MB/s eta 0:00:00
2025-05-07T20:28:30.4030847Z Collecting jinja2 (from torch)
2025-05-07T20:28:30.4031327Z   Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB)
2025-05-07T20:28:30.4031834Z Collecting fsspec (from torch)
2025-05-07T20:28:30.4032322Z   Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB)
2025-05-07T20:28:30.4032893Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch)
2025-05-07T20:28:30.4033601Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB)
2025-05-07T20:28:30.4034382Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 62.0 MB/s eta 0:00:00
2025-05-07T20:28:30.4034804Z Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch)
2025-05-07T20:28:30.4035522Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (897 kB)
2025-05-07T20:28:30.4037097Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 11.5 MB/s eta 0:00:00
2025-05-07T20:28:30.4037504Z Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch)
2025-05-07T20:28:30.4038395Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl (8.9 MB)
2025-05-07T20:28:30.4039178Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 38.8 MB/s eta 0:00:00
2025-05-07T20:28:30.4039558Z Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch)
2025-05-07T20:28:30.4040227Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB)
2025-05-07T20:28:30.4040977Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 33.3 MB/s eta 0:00:00
2025-05-07T20:28:30.4041612Z Collecting nvidia-cublas-cu12==12.6.4.1 (from torch)
2025-05-07T20:28:30.4042388Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB)
2025-05-07T20:28:30.4043229Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 393.1/393.1 MB 59.9 MB/s eta 0:00:00
2025-05-07T20:28:30.4043609Z Collecting nvidia-cufft-cu12==11.3.0.4 (from torch)
2025-05-07T20:28:30.4044272Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl (200.2 MB)
2025-05-07T20:28:30.4045025Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.2/200.2 MB 100.4 MB/s eta 0:00:00
2025-05-07T20:28:30.4045404Z Collecting nvidia-curand-cu12==10.3.7.77 (from torch)
2025-05-07T20:28:30.4046073Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl (56.3 MB)
2025-05-07T20:28:30.4046848Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 145.0 MB/s eta 0:00:00
2025-05-07T20:28:30.4047251Z Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch)
2025-05-07T20:28:30.4047934Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl (158.2 MB)
2025-05-07T20:28:30.4048702Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.2/158.2 MB 132.5 MB/s eta 0:00:00
2025-05-07T20:28:30.4049095Z Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch)
2025-05-07T20:28:30.4049778Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl (216.6 MB)
2025-05-07T20:28:30.4050547Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.6/216.6 MB 102.0 MB/s eta 0:00:00
2025-05-07T20:28:30.4050935Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch)
2025-05-07T20:28:30.4051676Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB)
2025-05-07T20:28:30.4052447Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 112.4 MB/s eta 0:00:00
2025-05-07T20:28:30.4052830Z Collecting nvidia-nccl-cu12==2.26.2 (from torch)
2025-05-07T20:28:30.4053717Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
2025-05-07T20:28:30.4054472Z Collecting nvidia-nvtx-cu12==12.6.77 (from torch)
2025-05-07T20:28:30.4055112Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (89 kB)
2025-05-07T20:28:30.4055771Z Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch)
2025-05-07T20:28:30.4056533Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB)
2025-05-07T20:28:30.4057379Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 192.4 MB/s eta 0:00:00
2025-05-07T20:28:30.4057767Z Collecting nvidia-cufile-cu12==1.11.1.6 (from torch)
2025-05-07T20:28:30.4058540Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:28:30.4059736Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch)
2025-05-07T20:28:30.4060551Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:30.4061365Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
2025-05-07T20:28:30.4061915Z   Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB)
2025-05-07T20:28:30.4062557Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 46.5 MB/s eta 0:00:00
2025-05-07T20:28:30.4062922Z Collecting MarkupSafe>=2.0 (from jinja2->torch)
2025-05-07T20:28:30.4063775Z   Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28 kB)
2025-05-07T20:28:30.4064820Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp312-cp312-manylinux_2_28_x86_64.whl (825.4 MB)
2025-05-07T20:28:30.4065624Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.4/825.4 MB 37.5 MB/s eta 0:00:00
2025-05-07T20:28:30.4066370Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB)
2025-05-07T20:28:30.4067202Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 55.5 MB/s eta 0:00:00
2025-05-07T20:28:30.4067941Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB)
2025-05-07T20:28:30.4068766Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 125.1 MB/s eta 0:00:00
2025-05-07T20:28:30.4069553Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB)
2025-05-07T20:28:30.4070437Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.5/153.5 MB 133.9 MB/s eta 0:00:00
2025-05-07T20:28:30.4072191Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
2025-05-07T20:28:30.4074105Z 
2025-05-07T20:28:30.4076047Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu126
2025-05-07T20:28:30.4078198Z 
2025-05-07T20:28:32.6246516Z torch                    2.8.0.dev20250507+cu126
2025-05-07T20:28:32.6248869Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu126)
2025-05-07T20:28:36.1134555Z [CHECK] Python (sub-)package 'torch.distributed' found ...
2025-05-07T20:28:39.6444422Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu126
2025-05-07T20:28:39.6445089Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ...
2025-05-07T20:28:43.0632400Z True
2025-05-07T20:28:43.0632694Z True
2025-05-07T20:28:43.0632801Z 
2025-05-07T20:28:43.1255121Z [INSTALL] Successfully installed PyTorch through PyTorch PIP
2025-05-07T20:28:43.1292926Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi
2025-05-07T20:28:43.1293628Z [36;1mif . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi[0m
2025-05-07T20:28:43.1305619Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:43.1305969Z env:
2025-05-07T20:28:43.1306200Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:43.1306496Z   BUILD_ENV: build_binary
2025-05-07T20:28:43.1306744Z   BUILD_TARGET: genai
2025-05-07T20:28:43.1306973Z   BUILD_VARIANT: cuda
2025-05-07T20:28:43.1307215Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:43.1307463Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:43.1307762Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:43.1308089Z ##[endgroup]
2025-05-07T20:28:43.4653619Z /home/ec2-user/miniconda/bin/conda
2025-05-07T20:28:43.4655475Z ################################################################################
2025-05-07T20:28:43.4655970Z # Collect PyTorch Environment Information (for Reporting Issues)
2025-05-07T20:28:43.4656331Z #
2025-05-07T20:28:43.4671647Z # [2025-05-07T20:28:43.466Z] + collect_pytorch_env_info build_binary
2025-05-07T20:28:43.4672062Z ################################################################################
2025-05-07T20:28:43.4672277Z 
2025-05-07T20:28:43.4687042Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:43.5579857Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:43.5589977Z [INFO] Downloading the PyTorch environment info collection script ...
2025-05-07T20:28:43.5590594Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
2025-05-07T20:28:43.5590984Z 
2025-05-07T20:28:43.6499797Z 
2025-05-07T20:28:43.6500382Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ...
2025-05-07T20:28:43.6523175Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python collect_env.py
2025-05-07T20:28:49.5930679Z Collecting environment information...
2025-05-07T20:28:49.5931037Z PyTorch version: 2.8.0.dev20250507+cu126
2025-05-07T20:28:49.5931323Z Is debug build: False
2025-05-07T20:28:49.5931586Z CUDA used to build PyTorch: 12.6
2025-05-07T20:28:49.5931871Z ROCM used to build PyTorch: N/A
2025-05-07T20:28:49.5932044Z 
2025-05-07T20:28:49.5932148Z OS: Amazon Linux 2023.6.20250317 (x86_64)
2025-05-07T20:28:49.5932468Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:28:49.5932784Z Clang version: Could not collect
2025-05-07T20:28:49.5933138Z CMake version: Could not collect
2025-05-07T20:28:49.5933417Z Libc version: glibc-2.34
2025-05-07T20:28:49.5933576Z 
2025-05-07T20:28:49.5933879Z Python version: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0] (64-bit runtime)
2025-05-07T20:28:49.5934484Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34
2025-05-07T20:28:49.5934888Z Is CUDA available: True
2025-05-07T20:28:49.5935145Z CUDA runtime version: 12.6.85
2025-05-07T20:28:49.5935421Z CUDA_MODULE_LOADING set to: LAZY
2025-05-07T20:28:49.5935727Z GPU models and configuration: GPU 0: NVIDIA A10G
2025-05-07T20:28:49.5936327Z Nvidia driver version: 570.133.07
2025-05-07T20:28:49.5936606Z cuDNN version: Could not collect
2025-05-07T20:28:49.5936881Z HIP runtime version: N/A
2025-05-07T20:28:49.5937122Z MIOpen runtime version: N/A
2025-05-07T20:28:49.5937375Z Is XNNPACK available: True
2025-05-07T20:28:49.5937539Z 
2025-05-07T20:28:49.5937619Z CPU:
2025-05-07T20:28:49.5937835Z Architecture:                         x86_64
2025-05-07T20:28:49.5938156Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:28:49.5938567Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:28:49.5938970Z Byte Order:                           Little Endian
2025-05-07T20:28:49.5939275Z CPU(s):                               16
2025-05-07T20:28:49.5939566Z On-line CPU(s) list:                  0-15
2025-05-07T20:28:49.5940084Z Vendor ID:                            AuthenticAMD
2025-05-07T20:28:49.5940418Z Model name:                           AMD EPYC 7R32
2025-05-07T20:28:49.5940732Z CPU family:                           23
2025-05-07T20:28:49.5941024Z Model:                                49
2025-05-07T20:28:49.5941300Z Thread(s) per core:                   2
2025-05-07T20:28:49.5941585Z Core(s) per socket:                   8
2025-05-07T20:28:49.5941864Z Socket(s):                            1
2025-05-07T20:28:49.5942136Z Stepping:                             0
2025-05-07T20:28:49.5942421Z BogoMIPS:                             5599.99
2025-05-07T20:28:49.5944443Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:28:49.5946451Z Hypervisor vendor:                    KVM
2025-05-07T20:28:49.5946757Z Virtualization type:                  full
2025-05-07T20:28:49.5947089Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:28:49.5947442Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:28:49.5947801Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:28:49.5948153Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:28:49.5948470Z NUMA node(s):                         1
2025-05-07T20:28:49.5948752Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:28:49.5949081Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:28:49.5949447Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:28:49.5949798Z Vulnerability L1tf:                   Not affected
2025-05-07T20:28:49.5950143Z Vulnerability Mds:                    Not affected
2025-05-07T20:28:49.5950494Z Vulnerability Meltdown:               Not affected
2025-05-07T20:28:49.5950837Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:28:49.5951199Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:28:49.5951731Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:28:49.5952299Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:28:49.5952823Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:28:49.5953497Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:28:49.5954335Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:28:49.5954990Z Vulnerability Srbds:                  Not affected
2025-05-07T20:28:49.5955429Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:28:49.5955660Z 
2025-05-07T20:28:49.5955762Z Versions of relevant libraries:
2025-05-07T20:28:49.5956024Z [pip3] numpy==2.2.5
2025-05-07T20:28:49.5956259Z [pip3] nvidia-cublas-cu12==12.6.4.1
2025-05-07T20:28:49.5956557Z [pip3] nvidia-cuda-cupti-cu12==12.6.80
2025-05-07T20:28:49.5956865Z [pip3] nvidia-cuda-nvrtc-cu12==12.6.77
2025-05-07T20:28:49.5957168Z [pip3] nvidia-cuda-runtime-cu12==12.6.77
2025-05-07T20:28:49.5957480Z [pip3] nvidia-cudnn-cu12==9.5.1.17
2025-05-07T20:28:49.5957762Z [pip3] nvidia-cufft-cu12==11.3.0.4
2025-05-07T20:28:49.5958043Z [pip3] nvidia-curand-cu12==10.3.7.77
2025-05-07T20:28:49.5958342Z [pip3] nvidia-cusolver-cu12==11.7.1.2
2025-05-07T20:28:49.5958693Z [pip3] nvidia-cusparse-cu12==12.5.4.2
2025-05-07T20:28:49.5959108Z [pip3] nvidia-cusparselt-cu12==0.6.3
2025-05-07T20:28:49.5960097Z [pip3] nvidia-nccl-cu12==2.26.2
2025-05-07T20:28:49.5960409Z [pip3] nvidia-nvjitlink-cu12==12.6.85
2025-05-07T20:28:49.5960731Z [pip3] nvidia-nvtx-cu12==12.6.77
2025-05-07T20:28:49.5961034Z [pip3] pytorch-triton==3.3.0+git96316ce5
2025-05-07T20:28:49.5961328Z [pip3] torch==2.8.0.dev20250507+cu126
2025-05-07T20:28:49.5961699Z [conda] cuda-cudart               12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:49.5962177Z [conda] cuda-cudart-dev           12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:49.5962738Z [conda] cuda-cudart-dev_linux-64  12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:49.5963490Z [conda] cuda-cudart-static        12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:49.5964070Z [conda] cuda-cudart-static_linux-64 12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:49.5964703Z [conda] cuda-cudart_linux-64      12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:49.5965327Z [conda] cuda-cupti                12.6.80              hbd13f7d_0    conda-forge
2025-05-07T20:28:49.5965909Z [conda] cuda-cupti-dev            12.6.80              h5888daf_0    conda-forge
2025-05-07T20:28:49.5966444Z [conda] cuda-libraries            12.6.3               ha770c72_0    conda-forge
2025-05-07T20:28:49.5967076Z [conda] cuda-libraries-dev        12.6.3               ha770c72_0    conda-forge
2025-05-07T20:28:49.5967659Z [conda] cuda-nvrtc                12.6.85              hbd13f7d_0    conda-forge
2025-05-07T20:28:49.5968189Z [conda] cuda-nvrtc-dev            12.6.85              h5888daf_0    conda-forge
2025-05-07T20:28:49.5968803Z [conda] cuda-nvtx                 12.6.77              hbd13f7d_0    conda-forge
2025-05-07T20:28:49.5969393Z [conda] cuda-opencl               12.6.77              hbd13f7d_0    conda-forge
2025-05-07T20:28:49.5979032Z [conda] cuda-opencl-dev           12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:49.5979520Z [conda] cuda-runtime              12.6.3               ha804496_0    conda-forge
2025-05-07T20:28:49.5979978Z [conda] libcublas                 12.6.4.1             h5888daf_1    conda-forge
2025-05-07T20:28:49.5980437Z [conda] libcublas-dev             12.6.4.1             h5888daf_1    conda-forge
2025-05-07T20:28:49.5980890Z [conda] libcufft                  11.3.0.4             hbd13f7d_0    conda-forge
2025-05-07T20:28:49.5981330Z [conda] libcufft-dev              11.3.0.4             h5888daf_0    conda-forge
2025-05-07T20:28:49.5981780Z [conda] libcurand                 10.3.7.77            hbd13f7d_0    conda-forge
2025-05-07T20:28:49.5982236Z [conda] libcurand-dev             10.3.7.77            h5888daf_0    conda-forge
2025-05-07T20:28:49.5982691Z [conda] libcusolver               11.7.1.2             h5888daf_1    conda-forge
2025-05-07T20:28:49.5983162Z [conda] libcusolver-dev           11.7.1.2             h5888daf_1    conda-forge
2025-05-07T20:28:49.5983643Z [conda] libcusparse               12.5.4.2             hbd13f7d_0    conda-forge
2025-05-07T20:28:49.5984112Z [conda] libcusparse-dev           12.5.4.2             h5888daf_0    conda-forge
2025-05-07T20:28:49.5984576Z [conda] libnvjitlink              12.6.85              hbd13f7d_0    conda-forge
2025-05-07T20:28:49.5985241Z [conda] libnvjitlink-dev          12.6.85              h5888daf_0    conda-forge
2025-05-07T20:28:49.5985693Z [conda] numpy                     2.2.5           py312h72c5963_0    conda-forge
2025-05-07T20:28:49.5986139Z [conda] nvidia-cublas-cu12        12.6.4.1                 pypi_0    pypi
2025-05-07T20:28:49.5986626Z [conda] nvidia-cuda-cupti-cu12    12.6.80                  pypi_0    pypi
2025-05-07T20:28:49.5987111Z [conda] nvidia-cuda-nvrtc-cu12    12.6.77                  pypi_0    pypi
2025-05-07T20:28:49.5987600Z [conda] nvidia-cuda-runtime-cu12  12.6.77                  pypi_0    pypi
2025-05-07T20:28:49.5988070Z [conda] nvidia-cudnn-cu12         9.5.1.17                 pypi_0    pypi
2025-05-07T20:28:49.5988655Z [conda] nvidia-cufft-cu12         11.3.0.4                 pypi_0    pypi
2025-05-07T20:28:49.5989121Z [conda] nvidia-curand-cu12        10.3.7.77                pypi_0    pypi
2025-05-07T20:28:49.5989602Z [conda] nvidia-cusolver-cu12      11.7.1.2                 pypi_0    pypi
2025-05-07T20:28:49.5990074Z [conda] nvidia-cusparse-cu12      12.5.4.2                 pypi_0    pypi
2025-05-07T20:28:49.5990556Z [conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
2025-05-07T20:28:49.5991030Z [conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
2025-05-07T20:28:49.5991493Z [conda] nvidia-nvjitlink-cu12     12.6.85                  pypi_0    pypi
2025-05-07T20:28:49.5991958Z [conda] nvidia-nvtx-cu12          12.6.77                  pypi_0    pypi
2025-05-07T20:28:49.5992423Z [conda] pytorch-triton            3.3.0+git96316ce5          pypi_0    pypi
2025-05-07T20:28:49.5992875Z [conda] torch                     2.8.0.dev20250507+cu126          pypi_0    pypi
2025-05-07T20:28:49.5993141Z 
2025-05-07T20:28:49.6695599Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV
2025-05-07T20:28:49.6696265Z [36;1m. $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV[0m
2025-05-07T20:28:49.6708241Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:49.6708599Z env:
2025-05-07T20:28:49.6708837Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:49.6709136Z   BUILD_ENV: build_binary
2025-05-07T20:28:49.6709456Z   BUILD_TARGET: genai
2025-05-07T20:28:49.6709724Z   BUILD_VARIANT: cuda
2025-05-07T20:28:49.6709957Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:49.6710229Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:49.6710532Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:49.6710862Z ##[endgroup]
2025-05-07T20:28:50.0099869Z ################################################################################
2025-05-07T20:28:50.0100265Z # Prepare FBGEMM-GPU Build
2025-05-07T20:28:50.0100535Z #
2025-05-07T20:28:50.0115545Z # [2025-05-07T20:28:50.011Z] + prepare_fbgemm_gpu_build build_binary
2025-05-07T20:28:50.0115947Z ################################################################################
2025-05-07T20:28:50.0116185Z 
2025-05-07T20:28:50.0130947Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:50.1001901Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:50.1023193Z [BUILD] Running git submodules update ...
2025-05-07T20:28:50.1044687Z [EXEC] [ATTEMPT 0/3]    + git submodule sync
2025-05-07T20:28:50.1410036Z Synchronizing submodule url for '../external/asmjit'
2025-05-07T20:28:50.1410492Z Synchronizing submodule url for '../external/composable_kernel'
2025-05-07T20:28:50.1410940Z Synchronizing submodule url for '../external/cpuinfo'
2025-05-07T20:28:50.1411333Z Synchronizing submodule url for '../external/cutlass'
2025-05-07T20:28:50.1411728Z Synchronizing submodule url for '../external/googletest'
2025-05-07T20:28:50.1412177Z Synchronizing submodule url for '../external/hipify_torch'
2025-05-07T20:28:50.1412578Z Synchronizing submodule url for '../external/json'
2025-05-07T20:28:50.1445853Z [EXEC] [ATTEMPT 0/3]    + git submodule update --init --recursive
2025-05-07T20:28:50.1999505Z [BUILD] Installing other build dependencies ...
2025-05-07T20:28:50.2021996Z [EXEC] [ATTEMPT 0/3]    + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt
2025-05-07T20:28:52.6036838Z Collecting backports.tarfile (from -r requirements.txt (line 13))
2025-05-07T20:28:52.6220076Z   Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB)
2025-05-07T20:28:52.7348254Z Collecting build (from -r requirements.txt (line 14))
2025-05-07T20:28:52.7379319Z   Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
2025-05-07T20:28:52.9610026Z Collecting cmake (from -r requirements.txt (line 15))
2025-05-07T20:28:52.9651609Z   Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB)
2025-05-07T20:28:53.0824200Z Collecting click (from -r requirements.txt (line 16))
2025-05-07T20:28:53.0869916Z   Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
2025-05-07T20:28:53.4226693Z Collecting hypothesis (from -r requirements.txt (line 17))
2025-05-07T20:28:53.4289568Z   Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB)
2025-05-07T20:28:53.4838071Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 18)) (3.1.4)
2025-05-07T20:28:53.4842920Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 19)) (1.3.0)
2025-05-07T20:28:53.5632619Z Collecting ninja (from -r requirements.txt (line 20))
2025-05-07T20:28:53.5663920Z   Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB)
2025-05-07T20:28:53.6176685Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 21)) (2.2.5)
2025-05-07T20:28:53.6671677Z Collecting pyre-extensions (from -r requirements.txt (line 22))
2025-05-07T20:28:53.6722710Z   Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB)
2025-05-07T20:28:53.8045790Z Collecting pyyaml (from -r requirements.txt (line 23))
2025-05-07T20:28:53.8076852Z   Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
2025-05-07T20:28:53.9218818Z Collecting scikit-build (from -r requirements.txt (line 24))
2025-05-07T20:28:53.9268298Z   Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB)
2025-05-07T20:28:53.9926338Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 25)) (78.1.1)
2025-05-07T20:28:54.0584104Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26))
2025-05-07T20:28:54.0627844Z   Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB)
2025-05-07T20:28:54.1770280Z Collecting tabulate (from -r requirements.txt (line 27))
2025-05-07T20:28:54.1799648Z   Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
2025-05-07T20:28:54.2985285Z Collecting patchelf (from -r requirements.txt (line 28))
2025-05-07T20:28:54.3033636Z   Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB)
2025-05-07T20:28:54.4296523Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:54.4325651Z   Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
2025-05-07T20:28:54.5364572Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14))
2025-05-07T20:28:54.5396165Z   Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB)
2025-05-07T20:28:54.6478347Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:54.6512470Z   Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:54.7506685Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:54.7549399Z   Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:54.7994984Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5)
2025-05-07T20:28:54.8456971Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:54.8486364Z   Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
2025-05-07T20:28:54.8938516Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2)
2025-05-07T20:28:54.9461991Z Collecting distro (from scikit-build->-r requirements.txt (line 24))
2025-05-07T20:28:54.9490801Z   Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
2025-05-07T20:28:55.0003499Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1)
2025-05-07T20:28:55.0699000Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:55.0729903Z   Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
2025-05-07T20:28:55.1287298Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB)
2025-05-07T20:28:55.1873138Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB)
2025-05-07T20:28:55.2451619Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB)
2025-05-07T20:28:55.7922777Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 51.0 MB/s eta 0:00:00
2025-05-07T20:28:55.7956627Z Downloading click-8.1.8-py3-none-any.whl (98 kB)
2025-05-07T20:28:55.8521564Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB)
2025-05-07T20:28:55.9151074Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
2025-05-07T20:28:55.9694718Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB)
2025-05-07T20:28:56.0314069Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB)
2025-05-07T20:28:56.0882402Z Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (767 kB)
2025-05-07T20:28:56.1532864Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 767.5/767.5 kB 8.0 MB/s eta 0:00:00
2025-05-07T20:28:56.1582172Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB)
2025-05-07T20:28:56.2199571Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:56.2814714Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
2025-05-07T20:28:56.3402602Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB)
2025-05-07T20:28:56.4071146Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB)
2025-05-07T20:28:56.4615257Z Downloading packaging-25.0-py3-none-any.whl (66 kB)
2025-05-07T20:28:56.5188663Z Downloading distro-1.9.0-py3-none-any.whl (20 kB)
2025-05-07T20:28:56.5822125Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:56.6343077Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
2025-05-07T20:28:56.6946226Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB)
2025-05-07T20:28:56.8635902Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions
2025-05-07T20:28:59.2492447Z 
2025-05-07T20:28:59.2537842Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0
2025-05-07T20:28:59.4220686Z ################################################################################
2025-05-07T20:28:59.4221089Z # Install PyTorch (PyTorch PIP)
2025-05-07T20:28:59.4221359Z #
2025-05-07T20:28:59.4239226Z # [2025-05-07T20:28:59.423Z] + install_triton_pip build_binary
2025-05-07T20:28:59.4239617Z ################################################################################
2025-05-07T20:28:59.4239837Z 
2025-05-07T20:28:59.4240058Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ...
2025-05-07T20:28:59.4240491Z ################################################################################
2025-05-07T20:28:59.4240838Z # Install Package From PyTorch PIP: pytorch-triton
2025-05-07T20:28:59.4241157Z #
2025-05-07T20:28:59.4257408Z # [2025-05-07T20:28:59.425Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8
2025-05-07T20:28:59.4258043Z ################################################################################
2025-05-07T20:28:59.4258267Z 
2025-05-07T20:28:59.4273675Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:59.5170868Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:59.5171552Z ################################################################################
2025-05-07T20:28:59.5172196Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:28:59.5172513Z #
2025-05-07T20:28:59.5188655Z # [2025-05-07T20:28:59.518Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 
2025-05-07T20:28:59.5189132Z ################################################################################
2025-05-07T20:28:59.5189341Z 
2025-05-07T20:28:59.5236362Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8)
2025-05-07T20:28:59.5254032Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:28:59.5254552Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:59.5261791Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:59.5271311Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ...
2025-05-07T20:28:59.5292887Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/
2025-05-07T20:29:07.4012483Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
2025-05-07T20:29:07.4013900Z torch 2.8.0.dev20250507+cu126 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux" and platform_machine == "x86_64", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible.
2025-05-07T20:29:07.4014604Z 
2025-05-07T20:29:07.4014818Z Looking in indexes: https://download.pytorch.org/whl/nightly/
2025-05-07T20:29:07.4015218Z Collecting pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:29:07.4016014Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB)
2025-05-07T20:29:07.4017201Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB)
2025-05-07T20:29:07.4018260Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 52.6 MB/s eta 0:00:00
2025-05-07T20:29:07.4018634Z Installing collected packages: pytorch-triton
2025-05-07T20:29:07.4018968Z   Attempting uninstall: pytorch-triton
2025-05-07T20:29:07.4019351Z     Found existing installation: pytorch-triton 3.3.0+git96316ce5
2025-05-07T20:29:07.4019765Z     Uninstalling pytorch-triton-3.3.0+git96316ce5:
2025-05-07T20:29:07.4020189Z       Successfully uninstalled pytorch-triton-3.3.0+git96316ce5
2025-05-07T20:29:07.4020624Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8
2025-05-07T20:29:07.4021200Z 
2025-05-07T20:29:09.6245839Z [CHECK] Python (sub-)package 'triton' found ...
2025-05-07T20:29:09.6249830Z [CHECK] Printing out the pytorch-triton version ...
2025-05-07T20:29:11.7698733Z ################################################################################
2025-05-07T20:29:11.7699195Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0
2025-05-07T20:29:11.7699590Z ################################################################################
2025-05-07T20:29:11.7699816Z 
2025-05-07T20:29:13.8123993Z [CHECK] Python (sub-)package 'numpy' found ...
2025-05-07T20:29:15.9878553Z [CHECK] Python (sub-)package 'skbuild' found ...
2025-05-07T20:29:15.9882080Z [BUILD] Successfully ran git submodules update
2025-05-07T20:29:15.9915097Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl
2025-05-07T20:29:15.9915592Z [36;1m. $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl[0m
2025-05-07T20:29:15.9927205Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:15.9927567Z env:
2025-05-07T20:29:15.9927799Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:15.9928091Z   BUILD_ENV: build_binary
2025-05-07T20:29:15.9928343Z   BUILD_TARGET: genai
2025-05-07T20:29:15.9928577Z   BUILD_VARIANT: cuda
2025-05-07T20:29:15.9928816Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:29:15.9929070Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:15.9929368Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:15.9929770Z ##[endgroup]
2025-05-07T20:29:16.3295744Z ################################################################################
2025-05-07T20:29:16.3296136Z # Install FBGEMM-GPU from Wheel
2025-05-07T20:29:16.3296401Z #
2025-05-07T20:29:16.3312502Z # [2025-05-07T20:29:16.330Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:16.3313146Z ################################################################################
2025-05-07T20:29:16.3313364Z 
2025-05-07T20:29:16.3313735Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:16.3314422Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:16.3314762Z 
2025-05-07T20:29:16.3432667Z 839b6c4a76b132decd86ba2192408e2709e83cea  fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:16.3435357Z 
2025-05-07T20:29:16.3435748Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:16.3569602Z 
2025-05-07T20:29:16.3570402Z 1b0d0e6113168fc8d58f5641aa11b1400e22aeae573cc3e05b442ee4be9a1e2d  fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:16.3573052Z 
2025-05-07T20:29:16.3582387Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:16.3582897Z 
2025-05-07T20:29:16.3805145Z 54d55da1a6aeedb5d1904417fe635ccb  fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:16.3807340Z 
2025-05-07T20:29:16.3817312Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl ...
2025-05-07T20:29:16.3839611Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:19.0527472Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:19.0528396Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5)
2025-05-07T20:29:19.0529230Z Installing collected packages: fbgemm-gpu-genai-nightly
2025-05-07T20:29:19.0529669Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7
2025-05-07T20:29:19.0529932Z 
2025-05-07T20:29:26.0228485Z ################################################################################
2025-05-07T20:29:26.0229284Z [CHECK] !!!!    INFO    !!!!
2025-05-07T20:29:26.0229663Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu126
2025-05-07T20:29:26.0230080Z [CHECK] CUDA version reported by PyTorch is: 12.6
2025-05-07T20:29:26.0230398Z [CHECK]
2025-05-07T20:29:26.0230720Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU
2025-05-07T20:29:26.0231208Z [CHECK]       package channel; the package may be broken at runtime!!!
2025-05-07T20:29:26.0231603Z ################################################################################
2025-05-07T20:29:26.0231820Z 
2025-05-07T20:29:26.0231945Z [INSTALL] Checking imports and symbols ...
2025-05-07T20:29:30.0443915Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:29:34.0495168Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:38.0567072Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:38.0570306Z [CHECK] Printing out the FBGEMM-GPU version ...
2025-05-07T20:29:50.0720150Z ################################################################################
2025-05-07T20:29:50.0720698Z [CHECK] The installed FBGEMM TARGET is: genai
2025-05-07T20:29:50.0721165Z [CHECK] The installed FBGEMM VARIANT is: cuda
2025-05-07T20:29:50.0721517Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7
2025-05-07T20:29:50.0721858Z ################################################################################
2025-05-07T20:29:50.0722073Z 
2025-05-07T20:29:58.0875223Z ################################################################################
2025-05-07T20:29:58.0875669Z [CHECK] FBGEMM_GPU Experimental Packages
2025-05-07T20:29:58.0877046Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils']
2025-05-07T20:29:58.0878590Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']
2025-05-07T20:29:58.0879106Z ################################################################################
2025-05-07T20:29:58.0879331Z 
2025-05-07T20:29:58.0879488Z [INSTALL] Check for installation of Python sources ...
2025-05-07T20:30:02.0989163Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ...
2025-05-07T20:30:06.1091444Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ...
2025-05-07T20:30:10.2298220Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ...
2025-05-07T20:30:14.2328479Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ...
2025-05-07T20:30:14.2332298Z [INSTALL] Check for operator registrations ...
2025-05-07T20:30:18.1485730Z fbgemm.nccl_init
2025-05-07T20:30:18.1487788Z 
2025-05-07T20:30:18.2113438Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init
2025-05-07T20:30:22.1281832Z fbgemm.gqa_attn_splitk
2025-05-07T20:30:22.1282221Z 
2025-05-07T20:30:22.1900007Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk
2025-05-07T20:30:26.1196413Z fbgemm.rope_qkv_decoding
2025-05-07T20:30:26.1196635Z 
2025-05-07T20:30:26.1815493Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding
2025-05-07T20:30:26.1816135Z [INSTALL] FBGEMM-GPU installation through wheel completed ...
2025-05-07T20:30:26.1851499Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV
2025-05-07T20:30:26.1851962Z [36;1m. $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV[0m
2025-05-07T20:30:26.1866572Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:30:26.1866934Z env:
2025-05-07T20:30:26.1867176Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:30:26.1867476Z   BUILD_ENV: build_binary
2025-05-07T20:30:26.1867927Z   BUILD_TARGET: genai
2025-05-07T20:30:26.1868164Z   BUILD_VARIANT: cuda
2025-05-07T20:30:26.1868401Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:30:26.1868669Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:30:26.1868979Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:30:26.1869313Z ##[endgroup]
2025-05-07T20:30:26.5231475Z ################################################################################
2025-05-07T20:30:26.5231872Z # Test All FBGEMM-GPU Modules
2025-05-07T20:30:26.5232132Z #
2025-05-07T20:30:26.5246732Z # [2025-05-07T20:30:26.524Z] + test_all_fbgemm_gpu_modules build_binary
2025-05-07T20:30:26.5247138Z ################################################################################
2025-05-07T20:30:26.5247359Z 
2025-05-07T20:30:34.5001428Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda)
2025-05-07T20:30:34.5002003Z [TEST] Will be running tests specific to this target and variant ...
2025-05-07T20:30:34.5002397Z [TEST] Determined the test directories:
2025-05-07T20:30:34.5002733Z fbgemm_gpu/experimental/gen_ai/test
2025-05-07T20:30:34.5003033Z fbgemm_gpu/experimental/example/test
2025-05-07T20:30:34.5003335Z fbgemm_gpu/experimental/gemm/test
2025-05-07T20:30:34.5003519Z 
2025-05-07T20:30:34.5012372Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ...
2025-05-07T20:30:34.5019215Z [TEST] Set environment variables for CUDA testing ...
2025-05-07T20:30:34.5019653Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES
2025-05-07T20:30:34.5019928Z 
2025-05-07T20:30:34.9245025Z 
2025-05-07T20:30:34.9245219Z [TEST] Installing PyTest ...
2025-05-07T20:30:34.9269753Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest
2025-05-07T20:30:36.0295302Z Channels:
2025-05-07T20:30:36.0295754Z  - conda-forge
2025-05-07T20:30:36.0296202Z Platform: linux-64
2025-05-07T20:30:39.3250143Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:30:40.4662025Z Solving environment: \ | / done
2025-05-07T20:30:40.6909468Z 
2025-05-07T20:30:40.6910276Z ## Package Plan ##
2025-05-07T20:30:40.6910517Z 
2025-05-07T20:30:40.6912930Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:30:40.6913426Z 
2025-05-07T20:30:40.6913585Z   added / updated specs:
2025-05-07T20:30:40.6914117Z     - expecttest
2025-05-07T20:30:40.6914477Z     - pytest
2025-05-07T20:30:40.6914632Z 
2025-05-07T20:30:40.6914636Z 
2025-05-07T20:30:40.6914822Z The following packages will be downloaded:
2025-05-07T20:30:40.6915079Z 
2025-05-07T20:30:40.6915284Z     package                    |            build
2025-05-07T20:30:40.6915738Z     ---------------------------|-----------------
2025-05-07T20:30:40.6916210Z     colorama-0.4.6             |     pyhd8ed1ab_1          26 KB  conda-forge
2025-05-07T20:30:40.6916766Z     exceptiongroup-1.2.2       |     pyhd8ed1ab_1          20 KB  conda-forge
2025-05-07T20:30:40.6917436Z     expecttest-0.3.0           |     pyhd8ed1ab_0          14 KB  conda-forge
2025-05-07T20:30:40.6917977Z     iniconfig-2.0.0            |     pyhd8ed1ab_1          11 KB  conda-forge
2025-05-07T20:30:40.6918573Z     packaging-25.0             |     pyh29332c3_1          61 KB  conda-forge
2025-05-07T20:30:40.6919049Z     pluggy-1.5.0               |     pyhd8ed1ab_1          23 KB  conda-forge
2025-05-07T20:30:40.6919549Z     pytest-8.3.5               |     pyhd8ed1ab_0         254 KB  conda-forge
2025-05-07T20:30:40.6920465Z     tomli-2.2.1                |     pyhd8ed1ab_1          19 KB  conda-forge
2025-05-07T20:30:40.6921085Z     ------------------------------------------------------------
2025-05-07T20:30:40.6921515Z                                            Total:         428 KB
2025-05-07T20:30:40.6921868Z 
2025-05-07T20:30:40.6922051Z The following NEW packages will be INSTALLED:
2025-05-07T20:30:40.6922299Z 
2025-05-07T20:30:40.6922565Z   colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 
2025-05-07T20:30:40.6923348Z   exceptiongroup     conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 
2025-05-07T20:30:40.6924115Z   expecttest         conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 
2025-05-07T20:30:40.6924772Z   iniconfig          conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 
2025-05-07T20:30:40.6925333Z   packaging          conda-forge/noarch::packaging-25.0-pyh29332c3_1 
2025-05-07T20:30:40.6925947Z   pluggy             conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 
2025-05-07T20:30:40.6926440Z   pytest             conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 
2025-05-07T20:30:40.6926961Z   tomli              conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 
2025-05-07T20:30:40.6927300Z 
2025-05-07T20:30:40.6927355Z 
2025-05-07T20:30:40.6927360Z 
2025-05-07T20:30:40.6927539Z Downloading and Extracting Packages: ...working...
2025-05-07T20:30:40.6928028Z pytest-8.3.5         | 254 KB    |            |   0% 
2025-05-07T20:30:40.6928278Z 
2025-05-07T20:30:40.6928663Z packaging-25.0       | 61 KB     |            |   0% [A
2025-05-07T20:30:40.6929047Z 
2025-05-07T20:30:40.6929051Z 
2025-05-07T20:30:40.6932375Z colorama-0.4.6       | 26 KB     |            |   0% [A[A
2025-05-07T20:30:40.6932734Z 
2025-05-07T20:30:40.6932738Z 
2025-05-07T20:30:40.6937018Z 
2025-05-07T20:30:40.6951702Z pluggy-1.5.0         | 23 KB     |            |   0% [A[A[A
2025-05-07T20:30:40.6952054Z 
2025-05-07T20:30:40.6952058Z 
2025-05-07T20:30:40.6952062Z 
2025-05-07T20:30:40.6955153Z 
2025-05-07T20:30:40.6987785Z exceptiongroup-1.2.2 | 20 KB     |            |   0% [A[A[A[A
2025-05-07T20:30:40.6988229Z 
2025-05-07T20:30:40.6988234Z 
2025-05-07T20:30:40.6988238Z 
2025-05-07T20:30:40.6988241Z 
2025-05-07T20:30:40.6988245Z 
2025-05-07T20:30:40.6989393Z tomli-2.2.1          | 19 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:30:40.6989780Z 
2025-05-07T20:30:40.6989783Z 
2025-05-07T20:30:40.6989787Z 
2025-05-07T20:30:40.6989794Z 
2025-05-07T20:30:40.6989798Z 
2025-05-07T20:30:40.6989993Z 
2025-05-07T20:30:40.6992761Z expecttest-0.3.0     | 14 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:30:40.6993135Z 
2025-05-07T20:30:40.6993140Z 
2025-05-07T20:30:40.6993144Z 
2025-05-07T20:30:40.6993148Z 
2025-05-07T20:30:40.6993151Z 
2025-05-07T20:30:40.6993155Z 
2025-05-07T20:30:40.6993159Z 
2025-05-07T20:30:40.8163669Z iniconfig-2.0.0      | 11 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:30:40.8163981Z 
2025-05-07T20:30:40.8163986Z 
2025-05-07T20:30:40.8163990Z 
2025-05-07T20:30:40.8217452Z 
2025-05-07T20:30:40.8585924Z exceptiongroup-1.2.2 | 20 KB     | #######9   |  80% [A[A[A[A
2025-05-07T20:30:40.8586234Z 
2025-05-07T20:30:40.8586238Z 
2025-05-07T20:30:40.8586242Z 
2025-05-07T20:30:40.8586246Z 
2025-05-07T20:30:40.9824487Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:40.9824819Z 
2025-05-07T20:30:40.9824823Z 
2025-05-07T20:30:40.9824827Z 
2025-05-07T20:30:40.9824831Z 
2025-05-07T20:30:40.9826481Z 
2025-05-07T20:30:40.9881128Z tomli-2.2.1          | 19 KB     | ########5  |  85% [A[A[A[A[A
2025-05-07T20:30:40.9881434Z 
2025-05-07T20:30:40.9881439Z 
2025-05-07T20:30:40.9881442Z 
2025-05-07T20:30:40.9881446Z 
2025-05-07T20:30:40.9881981Z 
2025-05-07T20:30:41.0539837Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:41.0541466Z 
2025-05-07T20:30:41.0552064Z packaging-25.0       | 61 KB     | ##6        |  26% [A
2025-05-07T20:30:41.0552549Z 
2025-05-07T20:30:41.0552553Z 
2025-05-07T20:30:41.0552557Z 
2025-05-07T20:30:41.0552561Z 
2025-05-07T20:30:41.0552565Z 
2025-05-07T20:30:41.0554612Z 
2025-05-07T20:30:41.0569821Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:41.0570140Z 
2025-05-07T20:30:41.0570144Z 
2025-05-07T20:30:41.0570148Z 
2025-05-07T20:30:41.0570151Z 
2025-05-07T20:30:41.0570155Z 
2025-05-07T20:30:41.0571471Z 
2025-05-07T20:30:41.0646567Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:41.0647436Z 
2025-05-07T20:30:41.0695871Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:41.0869297Z pytest-8.3.5         | 254 KB    | 6          |   6% 
2025-05-07T20:30:41.0886499Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:41.0886861Z 
2025-05-07T20:30:41.0886866Z 
2025-05-07T20:30:41.0911516Z colorama-0.4.6       | 26 KB     | ######     |  61% [A[A
2025-05-07T20:30:41.0911862Z 
2025-05-07T20:30:41.0912532Z 
2025-05-07T20:30:41.0972984Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:41.0973348Z 
2025-05-07T20:30:41.0973362Z 
2025-05-07T20:30:41.0973366Z 
2025-05-07T20:30:41.0973370Z 
2025-05-07T20:30:41.0973374Z 
2025-05-07T20:30:41.0973377Z 
2025-05-07T20:30:41.0983308Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:41.0983695Z 
2025-05-07T20:30:41.0983700Z 
2025-05-07T20:30:41.0983703Z 
2025-05-07T20:30:41.0983707Z 
2025-05-07T20:30:41.1039708Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:41.1040028Z 
2025-05-07T20:30:41.1040037Z 
2025-05-07T20:30:41.1040041Z 
2025-05-07T20:30:41.1040044Z 
2025-05-07T20:30:41.1040048Z 
2025-05-07T20:30:41.1176432Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:41.1176777Z 
2025-05-07T20:30:41.1176780Z 
2025-05-07T20:30:41.1178446Z 
2025-05-07T20:30:41.1202866Z pluggy-1.5.0         | 23 KB     | ######9    |  69% [A[A[A
2025-05-07T20:30:41.1203178Z 
2025-05-07T20:30:41.1203182Z 
2025-05-07T20:30:41.1203186Z 
2025-05-07T20:30:41.1292983Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:41.1293328Z 
2025-05-07T20:30:41.1293338Z 
2025-05-07T20:30:41.1293342Z 
2025-05-07T20:30:41.1293345Z 
2025-05-07T20:30:41.1293349Z 
2025-05-07T20:30:41.1293353Z 
2025-05-07T20:30:41.1294020Z 
2025-05-07T20:30:41.1311226Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:41.1311558Z 
2025-05-07T20:30:41.1311564Z 
2025-05-07T20:30:41.1311579Z 
2025-05-07T20:30:41.1311584Z 
2025-05-07T20:30:41.1311726Z 
2025-05-07T20:30:41.1311730Z 
2025-05-07T20:30:41.1313143Z 
2025-05-07T20:30:41.1345205Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:41.1345811Z 
2025-05-07T20:30:41.1537003Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:41.1537352Z 
2025-05-07T20:30:41.1537833Z 
2025-05-07T20:30:41.1634225Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:41.1634542Z 
2025-05-07T20:30:41.1634546Z 
2025-05-07T20:30:41.1635691Z 
2025-05-07T20:30:41.1759586Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:41.1760169Z 
2025-05-07T20:30:41.1760176Z 
2025-05-07T20:30:41.1760183Z 
2025-05-07T20:30:41.1760190Z 
2025-05-07T20:30:41.1760197Z 
2025-05-07T20:30:41.1760203Z 
2025-05-07T20:30:41.1760210Z 
2025-05-07T20:30:41.1833447Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:41.1834128Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:41.1840526Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:41.1841095Z                                                      
2025-05-07T20:30:41.1841486Z 
2025-05-07T20:30:41.1841856Z                                                      [A
2025-05-07T20:30:41.1842093Z 
2025-05-07T20:30:41.1842097Z 
2025-05-07T20:30:41.1842332Z                                                      [A[A
2025-05-07T20:30:41.1842584Z 
2025-05-07T20:30:41.1842588Z 
2025-05-07T20:30:41.1842591Z 
2025-05-07T20:30:41.1843078Z                                                      [A[A[A
2025-05-07T20:30:41.1843439Z 
2025-05-07T20:30:41.1843443Z 
2025-05-07T20:30:41.1843447Z 
2025-05-07T20:30:41.1843451Z 
2025-05-07T20:30:41.1843673Z                                                      [A[A[A[A
2025-05-07T20:30:41.1843942Z 
2025-05-07T20:30:41.1843945Z 
2025-05-07T20:30:41.1843949Z 
2025-05-07T20:30:41.1843953Z 
2025-05-07T20:30:41.1843956Z 
2025-05-07T20:30:41.1844314Z                                                      [A[A[A[A[A
2025-05-07T20:30:41.1844568Z 
2025-05-07T20:30:41.1844572Z 
2025-05-07T20:30:41.1844575Z 
2025-05-07T20:30:41.1844579Z 
2025-05-07T20:30:41.1844582Z 
2025-05-07T20:30:41.1844586Z 
2025-05-07T20:30:41.1844857Z                                                      [A[A[A[A[A[A
2025-05-07T20:30:41.1845143Z 
2025-05-07T20:30:41.1845147Z 
2025-05-07T20:30:41.1845150Z 
2025-05-07T20:30:41.1845154Z 
2025-05-07T20:30:41.1845157Z 
2025-05-07T20:30:41.1845161Z 
2025-05-07T20:30:41.1845165Z 
2025-05-07T20:30:41.1845415Z                                                      [A[A[A[A[A[A[A done
2025-05-07T20:30:41.2849042Z Preparing transaction: \ done
2025-05-07T20:30:41.3853434Z Verifying transaction: / done
2025-05-07T20:30:43.2885456Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:30:43.4144684Z [TEST] Checking imports ...
2025-05-07T20:30:47.3962383Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:30:47.3974632Z [TEST] Setting feature flags ...
2025-05-07T20:30:47.3975300Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1
2025-05-07T20:30:47.3975818Z 
2025-05-07T20:30:47.8233474Z 
2025-05-07T20:30:47.8234290Z [TEST] PyTest args:  -v -rsx -s -W ignore::pytest.PytestCollectionWarning
2025-05-07T20:30:47.8235082Z ################################################################################
2025-05-07T20:30:47.8235639Z # Run FBGEMM-GPU Tests: 
2025-05-07T20:30:47.8236053Z #
2025-05-07T20:30:47.8254456Z # [2025-05-07T20:30:47.825Z] + __run_fbgemm_gpu_tests_in_directory build_binary
2025-05-07T20:30:47.8255071Z ################################################################################
2025-05-07T20:30:47.8255316Z 
2025-05-07T20:30:47.8262974Z [TEST] Enumerating ALL test files ...
2025-05-07T20:30:47.8292177Z ./attention/gqa_test.py
2025-05-07T20:30:47.8292706Z ./coalesce/coalesce_test.py
2025-05-07T20:30:47.8293568Z ./comm/multi_gpu_car_test.py
2025-05-07T20:30:47.8294047Z ./gather_scatter/gather_scatter_test.py
2025-05-07T20:30:47.8294636Z ./kv_cache/kv_cache_test.py
2025-05-07T20:30:47.8295111Z ./moe/activation_test.py
2025-05-07T20:30:47.8295497Z ./moe/gather_scatter_test.py
2025-05-07T20:30:47.8296029Z ./moe/layers_test.py
2025-05-07T20:30:47.8296472Z ./moe/shuffling_test.py
2025-05-07T20:30:47.8296979Z ./quantize/quantize_test.py
2025-05-07T20:30:47.8297247Z 
2025-05-07T20:30:47.8297432Z [TEST] Enumerating IGNORED test files ...
2025-05-07T20:30:47.8297810Z 
2025-05-07T20:30:47.8312925Z ################################################################################
2025-05-07T20:30:47.8327673Z # [2025-05-07T20:30:47.832Z] Run Python Test Suite:
2025-05-07T20:30:47.8328123Z #   ./attention/gqa_test.py
2025-05-07T20:30:47.8328498Z ################################################################################
2025-05-07T20:30:47.8352441Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py
2025-05-07T20:30:47.8353120Z 
2025-05-07T20:30:50.3666744Z ============================= test session starts ==============================
2025-05-07T20:30:50.3667880Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:50.3668490Z cachedir: .pytest_cache
2025-05-07T20:30:50.3669417Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:50.3670297Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:50.3670810Z plugins: hypothesis-6.131.14
2025-05-07T20:30:52.0469357Z collecting ... collected 2 items
2025-05-07T20:30:52.0469721Z 
2025-05-07T20:31:30.4408592Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa(
2025-05-07T20:31:30.4409292Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4410469Z     int4_kv=False,
2025-05-07T20:31:30.4410842Z     num_groups=1,
2025-05-07T20:31:30.4412166Z     B=1,
2025-05-07T20:31:30.4414650Z     MAX_T=4,
2025-05-07T20:31:30.4415356Z     N_H_L=1,
2025-05-07T20:31:30.4416429Z )
2025-05-07T20:31:30.4417118Z Trying example: test_gqa(
2025-05-07T20:31:30.4417586Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4418141Z     int4_kv=True,
2025-05-07T20:31:30.4418559Z     num_groups=1,
2025-05-07T20:31:30.4418908Z     B=1,
2025-05-07T20:31:30.4419310Z     MAX_T=4,
2025-05-07T20:31:30.4419696Z     N_H_L=1,
2025-05-07T20:31:30.4420029Z )
2025-05-07T20:31:30.4420433Z Trying example: test_gqa(
2025-05-07T20:31:30.4420921Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4421372Z     int4_kv=True,
2025-05-07T20:31:30.4421788Z     num_groups=4,
2025-05-07T20:31:30.4422169Z     B=23,
2025-05-07T20:31:30.4422484Z     MAX_T=33,
2025-05-07T20:31:30.4422856Z     N_H_L=68,
2025-05-07T20:31:30.4423245Z )
2025-05-07T20:31:30.4423601Z Trying example: test_gqa(
2025-05-07T20:31:30.4433466Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4433892Z     int4_kv=True,
2025-05-07T20:31:30.4434168Z     num_groups=4,
2025-05-07T20:31:30.4434422Z     B=77,
2025-05-07T20:31:30.4434653Z     MAX_T=4,
2025-05-07T20:31:30.4434899Z     N_H_L=1,
2025-05-07T20:31:30.4435144Z )
2025-05-07T20:31:30.4435384Z Trying example: test_gqa(
2025-05-07T20:31:30.4435752Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4436142Z     int4_kv=True,
2025-05-07T20:31:30.4436407Z     num_groups=4,
2025-05-07T20:31:30.4436666Z     B=77,
2025-05-07T20:31:30.4436905Z     MAX_T=52,
2025-05-07T20:31:30.4437148Z     N_H_L=67,
2025-05-07T20:31:30.4437397Z )
2025-05-07T20:31:30.4437642Z Trying example: test_gqa(
2025-05-07T20:31:30.4438001Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4438392Z     int4_kv=False,
2025-05-07T20:31:30.4438655Z     num_groups=4,
2025-05-07T20:31:30.4438917Z     B=57,
2025-05-07T20:31:30.4439144Z     MAX_T=45,
2025-05-07T20:31:30.4439388Z     N_H_L=120,
2025-05-07T20:31:30.4439633Z )
2025-05-07T20:31:30.4439865Z Trying example: test_gqa(
2025-05-07T20:31:30.4440220Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4440619Z     int4_kv=True,
2025-05-07T20:31:30.4440913Z     num_groups=4,
2025-05-07T20:31:30.4441173Z     B=52,
2025-05-07T20:31:30.4441409Z     MAX_T=42,
2025-05-07T20:31:30.4441644Z     N_H_L=53,
2025-05-07T20:31:30.4441885Z )
2025-05-07T20:31:30.4442136Z Trying example: test_gqa(
2025-05-07T20:31:30.4442488Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4442872Z     int4_kv=True,
2025-05-07T20:31:30.4443139Z     num_groups=1,
2025-05-07T20:31:30.4443388Z     B=77,
2025-05-07T20:31:30.4443628Z     MAX_T=95,
2025-05-07T20:31:30.4443881Z     N_H_L=53,
2025-05-07T20:31:30.4444116Z )
2025-05-07T20:31:30.4444361Z Trying example: test_gqa(
2025-05-07T20:31:30.4444765Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4445160Z     int4_kv=True,
2025-05-07T20:31:30.4445421Z     num_groups=4,
2025-05-07T20:31:30.4445670Z     B=113,
2025-05-07T20:31:30.4445904Z     MAX_T=48,
2025-05-07T20:31:30.4446151Z     N_H_L=96,
2025-05-07T20:31:30.4446383Z )
2025-05-07T20:31:30.4446627Z Trying example: test_gqa(
2025-05-07T20:31:30.4446987Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4447365Z     int4_kv=False,
2025-05-07T20:31:30.4447629Z     num_groups=1,
2025-05-07T20:31:30.4447883Z     B=51,
2025-05-07T20:31:30.4448415Z     MAX_T=61,
2025-05-07T20:31:30.4448670Z     N_H_L=69,
2025-05-07T20:31:30.4448909Z )
2025-05-07T20:31:30.4449140Z Trying example: test_gqa(
2025-05-07T20:31:30.4449496Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4449879Z     int4_kv=False,
2025-05-07T20:31:30.4450132Z     num_groups=4,
2025-05-07T20:31:30.4450396Z     B=17,
2025-05-07T20:31:30.4450637Z     MAX_T=113,
2025-05-07T20:31:30.4450878Z     N_H_L=65,
2025-05-07T20:31:30.4451220Z )
2025-05-07T20:31:30.4451464Z Trying example: test_gqa(
2025-05-07T20:31:30.4451818Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4452195Z     int4_kv=False,
2025-05-07T20:31:30.4452456Z     num_groups=4,
2025-05-07T20:31:30.4452712Z     B=17,
2025-05-07T20:31:30.4452939Z     MAX_T=65,
2025-05-07T20:31:30.4453298Z     N_H_L=65,
2025-05-07T20:31:30.4453539Z )
2025-05-07T20:31:30.4453775Z Trying example: test_gqa(
2025-05-07T20:31:30.4454130Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4454526Z     int4_kv=False,
2025-05-07T20:31:30.4454781Z     num_groups=4,
2025-05-07T20:31:30.4455036Z     B=65,
2025-05-07T20:31:30.4455271Z     MAX_T=65,
2025-05-07T20:31:30.4455506Z     N_H_L=65,
2025-05-07T20:31:30.4455750Z )
2025-05-07T20:31:30.4455990Z Trying example: test_gqa(
2025-05-07T20:31:30.4456337Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4456727Z     int4_kv=False,
2025-05-07T20:31:30.4457033Z     num_groups=1,
2025-05-07T20:31:30.4457303Z     B=6,
2025-05-07T20:31:30.4457554Z     MAX_T=108,
2025-05-07T20:31:30.4457814Z     N_H_L=14,
2025-05-07T20:31:30.4458057Z )
2025-05-07T20:31:30.4458308Z Trying example: test_gqa(
2025-05-07T20:31:30.4458633Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4458939Z     int4_kv=False,
2025-05-07T20:31:30.4459156Z     num_groups=1,
2025-05-07T20:31:30.4459646Z     B=6,
2025-05-07T20:31:30.4459832Z     MAX_T=14,
2025-05-07T20:31:30.4460033Z     N_H_L=14,
2025-05-07T20:31:30.4460227Z )
2025-05-07T20:31:30.4460437Z Trying example: test_gqa(
2025-05-07T20:31:30.4460728Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4461042Z     int4_kv=False,
2025-05-07T20:31:30.4461258Z     num_groups=1,
2025-05-07T20:31:30.4461461Z     B=6,
2025-05-07T20:31:30.4461651Z     MAX_T=6,
2025-05-07T20:31:30.4461848Z     N_H_L=14,
2025-05-07T20:31:30.4462037Z )
2025-05-07T20:31:30.4462231Z Trying example: test_gqa(
2025-05-07T20:31:30.4462533Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4462836Z     int4_kv=False,
2025-05-07T20:31:30.4463057Z     num_groups=1,
2025-05-07T20:31:30.4463259Z     B=6,
2025-05-07T20:31:30.4463447Z     MAX_T=6,
2025-05-07T20:31:30.4463648Z     N_H_L=6,
2025-05-07T20:31:30.4463846Z )
2025-05-07T20:31:30.4464034Z Trying example: test_gqa(
2025-05-07T20:31:30.4464326Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4464640Z     int4_kv=False,
2025-05-07T20:31:30.4464852Z     num_groups=1,
2025-05-07T20:31:30.4465064Z     B=70,
2025-05-07T20:31:30.4465262Z     MAX_T=94,
2025-05-07T20:31:30.4465455Z     N_H_L=78,
2025-05-07T20:31:30.4465654Z )
2025-05-07T20:31:30.4465853Z Trying example: test_gqa(
2025-05-07T20:31:30.4466143Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4466456Z     int4_kv=False,
2025-05-07T20:31:30.4466665Z     num_groups=1,
2025-05-07T20:31:30.4466866Z     B=78,
2025-05-07T20:31:30.4467062Z     MAX_T=94,
2025-05-07T20:31:30.4467268Z     N_H_L=78,
2025-05-07T20:31:30.4467459Z )
2025-05-07T20:31:30.4467661Z Trying example: test_gqa(
2025-05-07T20:31:30.4467955Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4468269Z     int4_kv=False,
2025-05-07T20:31:30.4468481Z     num_groups=1,
2025-05-07T20:31:30.4468693Z     B=94,
2025-05-07T20:31:30.4468894Z     MAX_T=94,
2025-05-07T20:31:30.4469091Z     N_H_L=78,
2025-05-07T20:31:30.4469288Z )
2025-05-07T20:31:30.4469490Z Trying example: test_gqa(
2025-05-07T20:31:30.4469935Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4470249Z     int4_kv=False,
2025-05-07T20:31:30.4470466Z     num_groups=1,
2025-05-07T20:31:30.4470690Z     B=94,
2025-05-07T20:31:30.4470898Z     MAX_T=94,
2025-05-07T20:31:30.4471099Z     N_H_L=94,
2025-05-07T20:31:30.4471292Z )
2025-05-07T20:31:30.4471496Z Trying example: test_gqa(
2025-05-07T20:31:30.4471794Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4472108Z     int4_kv=False,
2025-05-07T20:31:30.4472451Z     num_groups=4,
2025-05-07T20:31:30.4472657Z     B=41,
2025-05-07T20:31:30.4472838Z     MAX_T=105,
2025-05-07T20:31:30.4473044Z     N_H_L=126,
2025-05-07T20:31:30.4473240Z )
2025-05-07T20:31:30.4473438Z Trying example: test_gqa(
2025-05-07T20:31:30.4473730Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4474036Z     int4_kv=False,
2025-05-07T20:31:30.4474235Z     num_groups=4,
2025-05-07T20:31:30.4474438Z     B=105,
2025-05-07T20:31:30.4474634Z     MAX_T=105,
2025-05-07T20:31:30.4474831Z     N_H_L=126,
2025-05-07T20:31:30.4475037Z )
2025-05-07T20:31:30.4475236Z Trying example: test_gqa(
2025-05-07T20:31:30.4475516Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4475823Z     int4_kv=False,
2025-05-07T20:31:30.4476043Z     num_groups=4,
2025-05-07T20:31:30.4476243Z     B=105,
2025-05-07T20:31:30.4476439Z     MAX_T=105,
2025-05-07T20:31:30.4476640Z     N_H_L=105,
2025-05-07T20:31:30.4476831Z )
2025-05-07T20:31:30.4477022Z Trying example: test_gqa(
2025-05-07T20:31:30.4477320Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4477623Z     int4_kv=True,
2025-05-07T20:31:30.4477825Z     num_groups=1,
2025-05-07T20:31:30.4478025Z     B=95,
2025-05-07T20:31:30.4478208Z     MAX_T=114,
2025-05-07T20:31:30.4478402Z     N_H_L=43,
2025-05-07T20:31:30.4478587Z )
2025-05-07T20:31:30.4478777Z Trying example: test_gqa(
2025-05-07T20:31:30.4479055Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4479362Z     int4_kv=True,
2025-05-07T20:31:30.4479567Z     num_groups=1,
2025-05-07T20:31:30.4479765Z     B=43,
2025-05-07T20:31:30.4479952Z     MAX_T=114,
2025-05-07T20:31:30.4480149Z     N_H_L=43,
2025-05-07T20:31:30.4480342Z )
2025-05-07T20:31:30.4480562Z Trying example: test_gqa(
2025-05-07T20:31:30.4480873Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4481175Z     int4_kv=True,
2025-05-07T20:31:30.4481388Z     num_groups=1,
2025-05-07T20:31:30.4481602Z     B=43,
2025-05-07T20:31:30.4481792Z     MAX_T=43,
2025-05-07T20:31:30.4481994Z     N_H_L=43,
2025-05-07T20:31:30.4482190Z )
2025-05-07T20:31:30.4482380Z Trying example: test_gqa(
2025-05-07T20:31:30.4482672Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4482979Z     int4_kv=False,
2025-05-07T20:31:30.4483183Z     num_groups=1,
2025-05-07T20:31:30.4483392Z     B=21,
2025-05-07T20:31:30.4483580Z     MAX_T=38,
2025-05-07T20:31:30.4483770Z     N_H_L=42,
2025-05-07T20:31:30.4483968Z )
2025-05-07T20:31:30.4484163Z Trying example: test_gqa(
2025-05-07T20:31:30.4484451Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4484769Z     int4_kv=False,
2025-05-07T20:31:30.4484985Z     num_groups=1,
2025-05-07T20:31:30.4485188Z     B=38,
2025-05-07T20:31:30.4485381Z     MAX_T=38,
2025-05-07T20:31:30.4485574Z     N_H_L=42,
2025-05-07T20:31:30.4485766Z )
2025-05-07T20:31:30.4485961Z Trying example: test_gqa(
2025-05-07T20:31:30.4486255Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4486569Z     int4_kv=False,
2025-05-07T20:31:30.4486779Z     num_groups=1,
2025-05-07T20:31:30.4486988Z     B=38,
2025-05-07T20:31:30.4487170Z     MAX_T=42,
2025-05-07T20:31:30.4487368Z     N_H_L=42,
2025-05-07T20:31:30.4487561Z )
2025-05-07T20:31:30.4487761Z Trying example: test_gqa(
2025-05-07T20:31:30.4488045Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4488358Z     int4_kv=False,
2025-05-07T20:31:30.4488569Z     num_groups=1,
2025-05-07T20:31:30.4488769Z     B=42,
2025-05-07T20:31:30.4488956Z     MAX_T=42,
2025-05-07T20:31:30.4489249Z     N_H_L=42,
2025-05-07T20:31:30.4489442Z )
2025-05-07T20:31:30.4489636Z Trying example: test_gqa(
2025-05-07T20:31:30.4489925Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4490227Z     int4_kv=True,
2025-05-07T20:31:30.4490445Z     num_groups=1,
2025-05-07T20:31:30.4490654Z     B=74,
2025-05-07T20:31:30.4490837Z     MAX_T=20,
2025-05-07T20:31:30.4491039Z     N_H_L=15,
2025-05-07T20:31:30.4491231Z )
2025-05-07T20:31:30.4491501Z Trying example: test_gqa(
2025-05-07T20:31:30.4491794Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4492100Z     int4_kv=True,
2025-05-07T20:31:30.4492308Z     num_groups=1,
2025-05-07T20:31:30.4492521Z     B=20,
2025-05-07T20:31:30.4492711Z     MAX_T=20,
2025-05-07T20:31:30.4492904Z     N_H_L=15,
2025-05-07T20:31:30.4493149Z )
2025-05-07T20:31:30.4493348Z Trying example: test_gqa(
2025-05-07T20:31:30.4493636Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4493952Z     int4_kv=True,
2025-05-07T20:31:30.4494166Z     num_groups=1,
2025-05-07T20:31:30.4494371Z     B=20,
2025-05-07T20:31:30.4494569Z     MAX_T=15,
2025-05-07T20:31:30.4494773Z     N_H_L=15,
2025-05-07T20:31:30.4494963Z )
2025-05-07T20:31:30.4495165Z Trying example: test_gqa(
2025-05-07T20:31:30.4495456Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4495766Z     int4_kv=True,
2025-05-07T20:31:30.4495970Z     num_groups=1,
2025-05-07T20:31:30.4496179Z     B=15,
2025-05-07T20:31:30.4496369Z     MAX_T=20,
2025-05-07T20:31:30.4496557Z     N_H_L=15,
2025-05-07T20:31:30.4496742Z )
2025-05-07T20:31:30.4496939Z Trying example: test_gqa(
2025-05-07T20:31:30.4497221Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4497529Z     int4_kv=True,
2025-05-07T20:31:30.4497742Z     num_groups=1,
2025-05-07T20:31:30.4497947Z     B=15,
2025-05-07T20:31:30.4498136Z     MAX_T=15,
2025-05-07T20:31:30.4498328Z     N_H_L=15,
2025-05-07T20:31:30.4498510Z )
2025-05-07T20:31:30.4498701Z Trying example: test_gqa(
2025-05-07T20:31:30.4498992Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4499295Z     int4_kv=False,
2025-05-07T20:31:30.4499504Z     num_groups=4,
2025-05-07T20:31:30.4499714Z     B=117,
2025-05-07T20:31:30.4499899Z     MAX_T=104,
2025-05-07T20:31:30.4500099Z     N_H_L=69,
2025-05-07T20:31:30.4500292Z )
2025-05-07T20:31:30.4500480Z Trying example: test_gqa(
2025-05-07T20:31:30.4500819Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4501130Z     int4_kv=False,
2025-05-07T20:31:30.4501335Z     num_groups=4,
2025-05-07T20:31:30.4501539Z     B=117,
2025-05-07T20:31:30.4501739Z     MAX_T=117,
2025-05-07T20:31:30.4501936Z     N_H_L=69,
2025-05-07T20:31:30.4502131Z )
2025-05-07T20:31:30.4502333Z Trying example: test_gqa(
2025-05-07T20:31:30.4502631Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4502940Z     int4_kv=False,
2025-05-07T20:31:30.4503166Z     num_groups=4,
2025-05-07T20:31:30.4503371Z     B=69,
2025-05-07T20:31:30.4503568Z     MAX_T=117,
2025-05-07T20:31:30.4503774Z     N_H_L=69,
2025-05-07T20:31:30.4503976Z )
2025-05-07T20:31:30.4504169Z Trying example: test_gqa(
2025-05-07T20:31:30.4504463Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.4504777Z     int4_kv=False,
2025-05-07T20:31:30.4504990Z     num_groups=4,
2025-05-07T20:31:30.4505204Z     B=117,
2025-05-07T20:31:30.4505406Z     MAX_T=69,
2025-05-07T20:31:30.4505602Z     N_H_L=69,
2025-05-07T20:31:30.4505810Z )
2025-05-07T20:31:30.4506000Z PASSED
2025-05-07T20:31:30.4609015Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...)
2025-05-07T20:31:30.4609342Z 
2025-05-07T20:31:30.4609494Z =========================== short test summary info ============================
2025-05-07T20:31:30.4610198Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when CUDA is not available or xformers is not available
2025-05-07T20:31:30.4611055Z ======================== 1 passed, 1 skipped in 40.60s =========================
2025-05-07T20:31:31.1118336Z 
2025-05-07T20:31:31.1118919Z [TEST] Python test suite PASSED: ./attention/gqa_test.py
2025-05-07T20:31:31.1138933Z [TEST] Python test time for ./attention/gqa_test.py: 44 seconds
2025-05-07T20:31:31.1139224Z 
2025-05-07T20:31:31.1139228Z 
2025-05-07T20:31:31.1139232Z 
2025-05-07T20:31:31.1139236Z 
2025-05-07T20:31:31.1159850Z ################################################################################
2025-05-07T20:31:31.1175239Z # [2025-05-07T20:31:31.117Z] Run Python Test Suite:
2025-05-07T20:31:31.1175632Z #   ./coalesce/coalesce_test.py
2025-05-07T20:31:31.1176011Z ################################################################################
2025-05-07T20:31:31.1201116Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py
2025-05-07T20:31:31.1201735Z 
2025-05-07T20:31:33.2781041Z ============================= test session starts ==============================
2025-05-07T20:31:33.2781678Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:33.2782205Z cachedir: .pytest_cache
2025-05-07T20:31:33.2782780Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:33.2783494Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:33.2783911Z plugins: hypothesis-6.131.14
2025-05-07T20:31:35.0191018Z collecting ... collected 1 item
2025-05-07T20:31:35.0191426Z 
2025-05-07T20:31:35.7737963Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED
2025-05-07T20:31:35.7738287Z 
2025-05-07T20:31:35.7738503Z ============================== 1 passed in 2.62s ===============================
2025-05-07T20:31:36.4056382Z 
2025-05-07T20:31:36.4057024Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py
2025-05-07T20:31:36.4076093Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds
2025-05-07T20:31:36.4076504Z 
2025-05-07T20:31:36.4076509Z 
2025-05-07T20:31:36.4076513Z 
2025-05-07T20:31:36.4076517Z 
2025-05-07T20:31:36.4098684Z ################################################################################
2025-05-07T20:31:36.4113840Z # [2025-05-07T20:31:36.411Z] Run Python Test Suite:
2025-05-07T20:31:36.4114304Z #   ./comm/multi_gpu_car_test.py
2025-05-07T20:31:36.4114625Z ################################################################################
2025-05-07T20:31:36.4138561Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py
2025-05-07T20:31:36.4139179Z 
2025-05-07T20:31:38.5711203Z ============================= test session starts ==============================
2025-05-07T20:31:38.5711997Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:38.5712550Z cachedir: .pytest_cache
2025-05-07T20:31:38.5713127Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:38.5713843Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:38.5714250Z plugins: hypothesis-6.131.14
2025-05-07T20:31:40.2748666Z collecting ... collected 5 items
2025-05-07T20:31:40.2749264Z 
2025-05-07T20:31:40.2761656Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED
2025-05-07T20:31:40.2770979Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED
2025-05-07T20:31:40.2779235Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED
2025-05-07T20:31:40.2787312Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED
2025-05-07T20:31:40.2806542Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED
2025-05-07T20:31:40.2807018Z 
2025-05-07T20:31:40.2807516Z =========================== short test summary info ============================
2025-05-07T20:31:40.2808276Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:40.2809194Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:40.2810269Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:40.2811181Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:40.2812087Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:40.2812733Z ============================== 5 skipped in 1.83s ==============================
2025-05-07T20:31:40.8505891Z 
2025-05-07T20:31:40.8506646Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py
2025-05-07T20:31:40.8526112Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 4 seconds
2025-05-07T20:31:40.8526522Z 
2025-05-07T20:31:40.8526529Z 
2025-05-07T20:31:40.8526534Z 
2025-05-07T20:31:40.8526539Z 
2025-05-07T20:31:40.8548805Z ################################################################################
2025-05-07T20:31:40.8564786Z # [2025-05-07T20:31:40.856Z] Run Python Test Suite:
2025-05-07T20:31:40.8565281Z #   ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:40.8565597Z ################################################################################
2025-05-07T20:31:40.8589644Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:40.8590313Z 
2025-05-07T20:31:43.0100281Z ============================= test session starts ==============================
2025-05-07T20:31:43.0101084Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:43.0101604Z cachedir: .pytest_cache
2025-05-07T20:31:43.0102169Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:43.0102942Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:43.0103352Z plugins: hypothesis-6.131.14
2025-05-07T20:31:44.8020277Z collecting ... collected 2 items
2025-05-07T20:31:44.8020859Z 
2025-05-07T20:31:44.8031178Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED
2025-05-07T20:31:44.8047449Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED
2025-05-07T20:31:44.8048039Z 
2025-05-07T20:31:44.8048281Z =========================== short test summary info ============================
2025-05-07T20:31:44.8048921Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:44.8049739Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:44.8058011Z ============================== 2 skipped in 1.92s ==============================
2025-05-07T20:31:45.3874004Z 
2025-05-07T20:31:45.3874804Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:45.3895297Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 5 seconds
2025-05-07T20:31:45.3895623Z 
2025-05-07T20:31:45.3895627Z 
2025-05-07T20:31:45.3895642Z 
2025-05-07T20:31:45.3895646Z 
2025-05-07T20:31:45.3916222Z ################################################################################
2025-05-07T20:31:45.3931922Z # [2025-05-07T20:31:45.392Z] Run Python Test Suite:
2025-05-07T20:31:45.3932393Z #   ./kv_cache/kv_cache_test.py
2025-05-07T20:31:45.3932758Z ################################################################################
2025-05-07T20:31:45.3957943Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py
2025-05-07T20:31:45.3958841Z 
2025-05-07T20:31:47.5469987Z ============================= test session starts ==============================
2025-05-07T20:31:47.5470754Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:47.5471281Z cachedir: .pytest_cache
2025-05-07T20:31:47.5471856Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:47.5472562Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:47.5473002Z plugins: hypothesis-6.131.14
2025-05-07T20:31:49.2392758Z collecting ... collected 4 items
2025-05-07T20:31:49.2392962Z 
2025-05-07T20:31:52.0040422Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...)
2025-05-07T20:31:52.0125323Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED
2025-05-07T20:31:52.0223335Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED
2025-05-07T20:31:52.0314217Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED
2025-05-07T20:31:52.0314706Z 
2025-05-07T20:31:52.0314919Z =========================== short test summary info ============================
2025-05-07T20:31:52.0315856Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when H100 is not available or MI300 is not available
2025-05-07T20:31:52.0316938Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when xformers is not available
2025-05-07T20:31:52.0317541Z ============================== 4 skipped in 4.61s ==============================
2025-05-07T20:31:53.8937269Z 
2025-05-07T20:31:53.8937866Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py
2025-05-07T20:31:53.8957358Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 8 seconds
2025-05-07T20:31:53.8957742Z 
2025-05-07T20:31:53.8957814Z 
2025-05-07T20:31:53.8957901Z 
2025-05-07T20:31:53.8957907Z 
2025-05-07T20:31:53.8978511Z ################################################################################
2025-05-07T20:31:53.8993563Z # [2025-05-07T20:31:53.899Z] Run Python Test Suite:
2025-05-07T20:31:53.8994021Z #   ./moe/activation_test.py
2025-05-07T20:31:53.8994393Z ################################################################################
2025-05-07T20:31:53.9020174Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py
2025-05-07T20:31:53.9020841Z 
2025-05-07T20:31:56.0571068Z ============================= test session starts ==============================
2025-05-07T20:31:56.0571707Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:56.0572225Z cachedir: .pytest_cache
2025-05-07T20:31:56.0572803Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:56.0573622Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:56.0574034Z plugins: hypothesis-6.131.14
2025-05-07T20:31:57.7163007Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:31:57.8271955Z collecting ... collected 2 items
2025-05-07T20:31:57.8272153Z 
2025-05-07T20:32:03.1626261Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul(
2025-05-07T20:32:03.1627455Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1627851Z     T=1,
2025-05-07T20:32:03.1628050Z     D=5120,
2025-05-07T20:32:03.1628248Z     contiguous=True,
2025-05-07T20:32:03.1628486Z     compiled=True,
2025-05-07T20:32:03.1628706Z )
2025-05-07T20:32:03.1628910Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1629294Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1629862Z     T=4096,
2025-05-07T20:32:03.1630053Z     D=5120,
2025-05-07T20:32:03.1630259Z     contiguous=True,
2025-05-07T20:32:03.1630498Z     compiled=True,
2025-05-07T20:32:03.1630706Z )
2025-05-07T20:32:03.1630914Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1631293Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1631667Z     T=4096,
2025-05-07T20:32:03.1631859Z     D=7168,
2025-05-07T20:32:03.1632067Z     contiguous=False,
2025-05-07T20:32:03.1632293Z     compiled=False,
2025-05-07T20:32:03.1632512Z )
2025-05-07T20:32:03.1632725Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1633092Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1633470Z     T=4096,
2025-05-07T20:32:03.1633668Z     D=5120,
2025-05-07T20:32:03.1633871Z     contiguous=False,
2025-05-07T20:32:03.1634097Z     compiled=True,
2025-05-07T20:32:03.1634308Z )
2025-05-07T20:32:03.1634512Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1634886Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1635268Z     T=1,
2025-05-07T20:32:03.1635461Z     D=7168,
2025-05-07T20:32:03.1635660Z     contiguous=True,
2025-05-07T20:32:03.1635895Z     compiled=True,
2025-05-07T20:32:03.1636109Z )
2025-05-07T20:32:03.1636307Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1636688Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1637064Z     T=1,
2025-05-07T20:32:03.1637253Z     D=7168,
2025-05-07T20:32:03.1637465Z     contiguous=False,
2025-05-07T20:32:03.1637704Z     compiled=True,
2025-05-07T20:32:03.1637912Z )
2025-05-07T20:32:03.1638124Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1638507Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1638876Z     T=4096,
2025-05-07T20:32:03.1639078Z     D=5120,
2025-05-07T20:32:03.1639282Z     contiguous=False,
2025-05-07T20:32:03.1639510Z     compiled=False,
2025-05-07T20:32:03.1639733Z )
2025-05-07T20:32:03.1639940Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1640319Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1640691Z     T=1,
2025-05-07T20:32:03.1640878Z     D=7168,
2025-05-07T20:32:03.1641078Z     contiguous=True,
2025-05-07T20:32:03.1641303Z     compiled=False,
2025-05-07T20:32:03.1641518Z )
2025-05-07T20:32:03.1641721Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1642092Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1642475Z     T=2048,
2025-05-07T20:32:03.1642678Z     D=5120,
2025-05-07T20:32:03.1642878Z     contiguous=True,
2025-05-07T20:32:03.1643115Z     compiled=True,
2025-05-07T20:32:03.1643331Z )
2025-05-07T20:32:03.1643526Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1643898Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1644285Z     T=2048,
2025-05-07T20:32:03.1644476Z     D=7168,
2025-05-07T20:32:03.1644682Z     contiguous=True,
2025-05-07T20:32:03.1644908Z     compiled=True,
2025-05-07T20:32:03.1645115Z )
2025-05-07T20:32:03.1645318Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1645694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1646077Z     T=2048,
2025-05-07T20:32:03.1646266Z     D=7168,
2025-05-07T20:32:03.1646472Z     contiguous=True,
2025-05-07T20:32:03.1646714Z     compiled=False,
2025-05-07T20:32:03.1646922Z )
2025-05-07T20:32:03.1647127Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1647610Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1647985Z     T=128,
2025-05-07T20:32:03.1648180Z     D=5120,
2025-05-07T20:32:03.1648393Z     contiguous=False,
2025-05-07T20:32:03.1648620Z     compiled=True,
2025-05-07T20:32:03.1648831Z )
2025-05-07T20:32:03.1649040Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1649408Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1649869Z     T=128,
2025-05-07T20:32:03.1650078Z     D=5120,
2025-05-07T20:32:03.1650274Z     contiguous=True,
2025-05-07T20:32:03.1650509Z     compiled=True,
2025-05-07T20:32:03.1650725Z )
2025-05-07T20:32:03.1650924Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1651302Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1651687Z     T=16384,
2025-05-07T20:32:03.1651905Z     D=5120,
2025-05-07T20:32:03.1652105Z     contiguous=False,
2025-05-07T20:32:03.1652344Z     compiled=True,
2025-05-07T20:32:03.1652563Z )
2025-05-07T20:32:03.1652759Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1653268Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1653656Z     T=16384,
2025-05-07T20:32:03.1653855Z     D=5120,
2025-05-07T20:32:03.1654057Z     contiguous=False,
2025-05-07T20:32:03.1654286Z     compiled=False,
2025-05-07T20:32:03.1654491Z )
2025-05-07T20:32:03.1654700Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1655085Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1655469Z     T=128,
2025-05-07T20:32:03.1655667Z     D=7168,
2025-05-07T20:32:03.1655864Z     contiguous=True,
2025-05-07T20:32:03.1656094Z     compiled=False,
2025-05-07T20:32:03.1656296Z )
2025-05-07T20:32:03.1656493Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1656866Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1657236Z     T=128,
2025-05-07T20:32:03.1657438Z     D=7168,
2025-05-07T20:32:03.1657651Z     contiguous=False,
2025-05-07T20:32:03.1657870Z     compiled=False,
2025-05-07T20:32:03.1658081Z )
2025-05-07T20:32:03.1658282Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1658653Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1659032Z     T=1,
2025-05-07T20:32:03.1659540Z     D=5120,
2025-05-07T20:32:03.1659746Z     contiguous=False,
2025-05-07T20:32:03.1659978Z     compiled=False,
2025-05-07T20:32:03.1660218Z )
2025-05-07T20:32:03.1660417Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1660794Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1661170Z     T=1,
2025-05-07T20:32:03.1661375Z     D=7168,
2025-05-07T20:32:03.1661646Z     contiguous=False,
2025-05-07T20:32:03.1661930Z     compiled=False,
2025-05-07T20:32:03.1662139Z )
2025-05-07T20:32:03.1662352Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1662738Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1663124Z     T=4096,
2025-05-07T20:32:03.1663354Z     D=5120,
2025-05-07T20:32:03.1663568Z     contiguous=True,
2025-05-07T20:32:03.1663797Z     compiled=False,
2025-05-07T20:32:03.1664007Z )
2025-05-07T20:32:03.1664215Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1664584Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1664963Z     T=128,
2025-05-07T20:32:03.1665165Z     D=7168,
2025-05-07T20:32:03.1665369Z     contiguous=True,
2025-05-07T20:32:03.1665587Z     compiled=True,
2025-05-07T20:32:03.1665798Z )
2025-05-07T20:32:03.1666001Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1666368Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1666746Z     T=1,
2025-05-07T20:32:03.1666935Z     D=5120,
2025-05-07T20:32:03.1667130Z     contiguous=False,
2025-05-07T20:32:03.1667358Z     compiled=True,
2025-05-07T20:32:03.1667567Z )
2025-05-07T20:32:03.1667763Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1668312Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1668692Z     T=4096,
2025-05-07T20:32:03.1668880Z     D=7168,
2025-05-07T20:32:03.1669078Z     contiguous=True,
2025-05-07T20:32:03.1669304Z     compiled=False,
2025-05-07T20:32:03.1669505Z )
2025-05-07T20:32:03.1669710Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1670086Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1670572Z     T=4096,
2025-05-07T20:32:03.1670766Z     D=7168,
2025-05-07T20:32:03.1670964Z     contiguous=False,
2025-05-07T20:32:03.1671193Z     compiled=True,
2025-05-07T20:32:03.1671398Z )
2025-05-07T20:32:03.1671603Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1671976Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1672345Z     T=128,
2025-05-07T20:32:03.1672540Z     D=5120,
2025-05-07T20:32:03.1672741Z     contiguous=True,
2025-05-07T20:32:03.1672961Z     compiled=False,
2025-05-07T20:32:03.1673180Z )
2025-05-07T20:32:03.1673385Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1673755Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1674134Z     T=128,
2025-05-07T20:32:03.1674327Z     D=5120,
2025-05-07T20:32:03.1674523Z     contiguous=False,
2025-05-07T20:32:03.1674756Z     compiled=False,
2025-05-07T20:32:03.1674970Z )
2025-05-07T20:32:03.1675168Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1675553Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1675937Z     T=1,
2025-05-07T20:32:03.1676119Z     D=5120,
2025-05-07T20:32:03.1676324Z     contiguous=True,
2025-05-07T20:32:03.1676556Z     compiled=False,
2025-05-07T20:32:03.1676759Z )
2025-05-07T20:32:03.1676960Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1677334Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1677710Z     T=2048,
2025-05-07T20:32:03.1677893Z     D=7168,
2025-05-07T20:32:03.1678094Z     contiguous=False,
2025-05-07T20:32:03.1678319Z     compiled=True,
2025-05-07T20:32:03.1678523Z )
2025-05-07T20:32:03.1678729Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1679098Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1679467Z     T=2048,
2025-05-07T20:32:03.1679659Z     D=7168,
2025-05-07T20:32:03.1679858Z     contiguous=False,
2025-05-07T20:32:03.1680082Z     compiled=False,
2025-05-07T20:32:03.1680298Z )
2025-05-07T20:32:03.1680501Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1680867Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1681241Z     T=16384,
2025-05-07T20:32:03.1681441Z     D=7168,
2025-05-07T20:32:03.1681635Z     contiguous=False,
2025-05-07T20:32:03.1681864Z     compiled=True,
2025-05-07T20:32:03.1682075Z )
2025-05-07T20:32:03.1682267Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1682636Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1683015Z     T=16384,
2025-05-07T20:32:03.1683214Z     D=7168,
2025-05-07T20:32:03.1683442Z     contiguous=True,
2025-05-07T20:32:03.1683688Z     compiled=True,
2025-05-07T20:32:03.1683901Z )
2025-05-07T20:32:03.1684096Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1684469Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1684844Z     T=4096,
2025-05-07T20:32:03.1685037Z     D=7168,
2025-05-07T20:32:03.1685244Z     contiguous=True,
2025-05-07T20:32:03.1685471Z     compiled=True,
2025-05-07T20:32:03.1685672Z )
2025-05-07T20:32:03.1685873Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1686247Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1686618Z     T=2048,
2025-05-07T20:32:03.1686810Z     D=5120,
2025-05-07T20:32:03.1687011Z     contiguous=False,
2025-05-07T20:32:03.1687239Z     compiled=False,
2025-05-07T20:32:03.1687451Z )
2025-05-07T20:32:03.1687653Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1688120Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1688498Z     T=2048,
2025-05-07T20:32:03.1688694Z     D=5120,
2025-05-07T20:32:03.1688890Z     contiguous=True,
2025-05-07T20:32:03.1689120Z     compiled=False,
2025-05-07T20:32:03.1689336Z )
2025-05-07T20:32:03.1689535Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1689908Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1690387Z     T=128,
2025-05-07T20:32:03.1690582Z     D=7168,
2025-05-07T20:32:03.1690774Z     contiguous=False,
2025-05-07T20:32:03.1691010Z     compiled=True,
2025-05-07T20:32:03.1691219Z )
2025-05-07T20:32:03.1691416Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1691786Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1692166Z     T=16384,
2025-05-07T20:32:03.1692362Z     D=5120,
2025-05-07T20:32:03.1692561Z     contiguous=True,
2025-05-07T20:32:03.1692784Z     compiled=True,
2025-05-07T20:32:03.1692994Z )
2025-05-07T20:32:03.1693294Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1693714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1694085Z     T=2048,
2025-05-07T20:32:03.1694276Z     D=5120,
2025-05-07T20:32:03.1694474Z     contiguous=False,
2025-05-07T20:32:03.1694701Z     compiled=True,
2025-05-07T20:32:03.1694906Z )
2025-05-07T20:32:03.1695108Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1695480Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1695858Z     T=16384,
2025-05-07T20:32:03.1696064Z     D=5120,
2025-05-07T20:32:03.1696270Z     contiguous=True,
2025-05-07T20:32:03.1696487Z     compiled=False,
2025-05-07T20:32:03.1696696Z )
2025-05-07T20:32:03.1696902Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1697265Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1697644Z     T=16384,
2025-05-07T20:32:03.1697842Z     D=7168,
2025-05-07T20:32:03.1698042Z     contiguous=False,
2025-05-07T20:32:03.1698274Z     compiled=False,
2025-05-07T20:32:03.1698487Z )
2025-05-07T20:32:03.1698683Z Trying example: test_silu_mul(
2025-05-07T20:32:03.1699057Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:03.1699433Z     T=16384,
2025-05-07T20:32:03.1699625Z     D=7168,
2025-05-07T20:32:03.1699826Z     contiguous=True,
2025-05-07T20:32:03.1700063Z     compiled=False,
2025-05-07T20:32:03.1700264Z )
2025-05-07T20:32:03.1700469Z PASSED
2025-05-07T20:32:03.2308110Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:03.2309356Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:03.2310718Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:03.2312252Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:03.2313217Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:03.2314522Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:03.2316221Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.2317197Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:03.2318409Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:03.2319912Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.2320966Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:03.2322244Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:03.2323537Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:03.2324741Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:03.2325934Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:03.2326757Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:03.2327774Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:03.2328780Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:03.2329563Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:32:03.2330766Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:03.2332033Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:03.2333224Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:03.2334304Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:03.2335461Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:03.2336803Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:03.2337849Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.2338848Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.2339583Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:03.2340586Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.2473405Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:03.2474766Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:03.2476096Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:03.2477520Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:03.2478483Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:03.2479781Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:03.2481149Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.2482133Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:03.2483354Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:03.2484713Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.2485774Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:03.2496013Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:03.2497275Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:03.2498495Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:03.2499696Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:03.2500526Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:03.2501542Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:03.2502790Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:03.2503588Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:32:03.2504777Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:03.2506116Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:03.2507214Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:03.2508242Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:03.2509397Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:03.2510730Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:03.2511780Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.2512679Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.2513425Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:03.2514421Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.2871392Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:03.2872482Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:03.2873805Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:03.2875225Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:03.2876202Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:03.2877498Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:03.2878866Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.2879843Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:03.2882206Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:03.2883590Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.2884783Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:03.2886051Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:03.2887292Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:03.2888499Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:03.2889696Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:03.2890519Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:03.2891537Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:03.2892547Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:03.2893420Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:32:03.2894614Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:03.2895893Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:03.2896993Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:03.2898022Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:03.2899194Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:03.2900529Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:03.2901585Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.2902491Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.2903228Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:03.2904313Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.2910112Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:03.2911150Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:03.2912575Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:03.2913992Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:03.2914967Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:03.2916249Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:03.2917622Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.2918601Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:03.2919821Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:03.2921189Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.2922237Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:03.2923564Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:03.2924794Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:03.2926006Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:03.2927205Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:03.2928024Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:03.2929047Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:03.2930054Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:03.2930848Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:32:03.2932131Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:03.2933480Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:03.2934690Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:03.2935718Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:03.2936890Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:03.2938237Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:03.2939285Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.2940195Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.2940932Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:03.2941938Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.7142597Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.7143359Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.7143787Z     T=1,
2025-05-07T20:32:03.7143980Z     D=5120,
2025-05-07T20:32:03.7144186Z     scale_ub=None,
2025-05-07T20:32:03.7144412Z     contiguous=True,
2025-05-07T20:32:03.7144644Z     compiled=True,
2025-05-07T20:32:03.7144882Z )
2025-05-07T20:32:03.7145213Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.7145704Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:03.7145965Z 
2025-05-07T20:32:03.7146046Z     @given(
2025-05-07T20:32:03.7146288Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.7146618Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.7146925Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.7147269Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.7147608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.7147895Z     )
2025-05-07T20:32:03.7148254Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.7148700Z     def test_silu_mul_quant(
2025-05-07T20:32:03.7148953Z         self,
2025-05-07T20:32:03.7149145Z         T: int,
2025-05-07T20:32:03.7149352Z         D: int,
2025-05-07T20:32:03.7149582Z         scale_ub: Optional[float],
2025-05-07T20:32:03.7149852Z         contiguous: bool,
2025-05-07T20:32:03.7150100Z         compiled: bool,
2025-05-07T20:32:03.7150332Z     ) -> None:
2025-05-07T20:32:03.7150544Z         torch.manual_seed(2025)
2025-05-07T20:32:03.7150788Z     
2025-05-07T20:32:03.7151069Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.7151413Z     
2025-05-07T20:32:03.7151615Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.7151915Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.7152565Z         x = x_sign * x_clamp
2025-05-07T20:32:03.7152818Z         x0 = x[:, :D]
2025-05-07T20:32:03.7153042Z         x1 = x[:, D:]
2025-05-07T20:32:03.7153277Z     
2025-05-07T20:32:03.7153493Z         if contiguous:
2025-05-07T20:32:03.7153754Z             x0 = x0.contiguous()
2025-05-07T20:32:03.7154019Z             x1 = x1.contiguous()
2025-05-07T20:32:03.7154255Z     
2025-05-07T20:32:03.7154449Z         if scale_ub is not None:
2025-05-07T20:32:03.7154883Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.7155216Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.7155530Z             )
2025-05-07T20:32:03.7155733Z         else:
2025-05-07T20:32:03.7155953Z             scale_ub_tensor = None
2025-05-07T20:32:03.7156204Z     
2025-05-07T20:32:03.7156443Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.7156759Z             op = silu_mul_quant
2025-05-07T20:32:03.7157005Z             if compiled:
2025-05-07T20:32:03.7157267Z                 op = torch.compile(op)
2025-05-07T20:32:03.7157574Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.7157850Z     
2025-05-07T20:32:03.7158049Z         y_fp8, y_scale = fn()
2025-05-07T20:32:03.7158343Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:03.7158642Z     
2025-05-07T20:32:03.7158886Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.7159627Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:03.7159977Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:03.7160292Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:03.7160655Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:03.7160963Z     
2025-05-07T20:32:03.7161169Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:03.7161363Z 
2025-05-07T20:32:03.7161472Z moe/activation_test.py:126: 
2025-05-07T20:32:03.7161773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.7162110Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:03.7162441Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:03.7163224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:03.7163968Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:03.7164521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.7165196Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.7165874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:03.7166587Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:03.7167312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:03.7167949Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:03.7168541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:03.7169051Z     fn()
2025-05-07T20:32:03.7169563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:03.7170136Z     self.fn.run(
2025-05-07T20:32:03.7170599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.7171125Z     kernel = self.compile(
2025-05-07T20:32:03.7171662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.7172299Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.7172841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.7173135Z 
2025-05-07T20:32:03.7173349Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a7025f8f0>
2025-05-07T20:32:03.7174419Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.7176028Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a7053bec0>}
2025-05-07T20:32:03.7177355Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.7178377Z context = <triton._C.libtriton.ir.context object at 0x7f6a70535d70>
2025-05-07T20:32:03.7178675Z 
2025-05-07T20:32:03.7178856Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.7179381Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.7179860Z                            module_map=module_map)
2025-05-07T20:32:03.7180242Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.7180613Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:03.7180884Z E       ^
2025-05-07T20:32:03.7181354Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.7181801Z 
2025-05-07T20:32:03.7182227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.7182734Z 
2025-05-07T20:32:03.7182847Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.7183281Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.7183722Z     T=2048,
2025-05-07T20:32:03.7183915Z     D=5120,
2025-05-07T20:32:03.7184106Z     scale_ub=1200.0,
2025-05-07T20:32:03.7184338Z     contiguous=True,
2025-05-07T20:32:03.7184570Z     compiled=False,
2025-05-07T20:32:03.7184783Z )
2025-05-07T20:32:04.0173089Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:04.0175143Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:04.0177797Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:04.0180620Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:04.0182528Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.0184445Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:04.0185809Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.0186779Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.0188342Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:04.0189695Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.0190894Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.0192154Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:04.0193415Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:04.0194639Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:04.0195819Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:04.0196644Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.0197655Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:04.0198667Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:04.0199449Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:32:04.0200633Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:04.0201905Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:04.0203010Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:04.0204044Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:04.0205214Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:04.0206544Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:04.0207592Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.0208493Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.0209225Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:04.0210300Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.0992389Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:04.0993582Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:04.0996649Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:04.0999458Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:04.1001379Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.1003839Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:04.1005263Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.1006247Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.1007469Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:04.1008823Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.1009875Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.1011142Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:04.1012375Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:04.1013734Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:04.1014918Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:04.1015736Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.1016752Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:04.1017760Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:04.1018685Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:32:04.1019883Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:04.1021146Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:04.1022348Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:04.1023371Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:04.1024530Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:04.1025862Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:04.1026911Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.1027815Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.1028541Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:04.1029535Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.3341742Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:04.3342787Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:04.3344094Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:04.3345511Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:04.3346474Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.3347763Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:04.3349120Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.3350098Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.3351304Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:04.3352968Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.3354075Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.3355334Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:04.3356694Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:04.3357898Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:04.3359092Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:04.3360168Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.3361178Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:04.3362186Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:04.3362968Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:32:04.3364158Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:04.3365424Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:04.3366524Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:04.3367551Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:04.3368711Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:04.3370042Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:04.3371086Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.3371985Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.3372721Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:04.3373812Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.3447241Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:04.3449449Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:04.3452069Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:04.3454379Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:04.3455343Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.3456623Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:04.3457981Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.3458948Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.3460381Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:04.3461742Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.3462782Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.3464041Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:04.3465271Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:04.3466462Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:04.3467654Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:04.3468465Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.3469472Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:04.3470477Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:04.3471255Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:32:04.3472559Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:04.3473877Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:04.3474970Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:04.3476591Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:04.3477750Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:04.3479089Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:04.3480141Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.3481037Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.3481770Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:04.3482777Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.6908339Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.6908899Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:04.6909182Z 
2025-05-07T20:32:04.6909266Z     @given(
2025-05-07T20:32:04.6909534Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.6909848Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.6910163Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.6910507Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.6910845Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.6911131Z     )
2025-05-07T20:32:04.6911487Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.6911942Z     def test_silu_mul_quant(
2025-05-07T20:32:04.6912189Z         self,
2025-05-07T20:32:04.6912399Z         T: int,
2025-05-07T20:32:04.6912606Z         D: int,
2025-05-07T20:32:04.6912825Z         scale_ub: Optional[float],
2025-05-07T20:32:04.6913113Z         contiguous: bool,
2025-05-07T20:32:04.6913364Z         compiled: bool,
2025-05-07T20:32:04.6913592Z     ) -> None:
2025-05-07T20:32:04.6913819Z         torch.manual_seed(2025)
2025-05-07T20:32:04.6914071Z     
2025-05-07T20:32:04.6914357Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.6914714Z     
2025-05-07T20:32:04.6914924Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.6915223Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.6915546Z         x = x_sign * x_clamp
2025-05-07T20:32:04.6915807Z         x0 = x[:, :D]
2025-05-07T20:32:04.6916032Z         x1 = x[:, D:]
2025-05-07T20:32:04.6916248Z     
2025-05-07T20:32:04.6916441Z         if contiguous:
2025-05-07T20:32:04.6916685Z             x0 = x0.contiguous()
2025-05-07T20:32:04.6916952Z             x1 = x1.contiguous()
2025-05-07T20:32:04.6917206Z     
2025-05-07T20:32:04.6917410Z         if scale_ub is not None:
2025-05-07T20:32:04.6917687Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.6918030Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.6918350Z             )
2025-05-07T20:32:04.6918564Z         else:
2025-05-07T20:32:04.6919128Z             scale_ub_tensor = None
2025-05-07T20:32:04.6919393Z     
2025-05-07T20:32:04.6919631Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.6919945Z             op = silu_mul_quant
2025-05-07T20:32:04.6920215Z             if compiled:
2025-05-07T20:32:04.6920471Z                 op = torch.compile(op)
2025-05-07T20:32:04.6920765Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.6921196Z     
2025-05-07T20:32:04.6921393Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.6929819Z 
2025-05-07T20:32:04.6929950Z moe/activation_test.py:117: 
2025-05-07T20:32:04.6930279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.6930622Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.6930911Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.6931606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.6932297Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.6932834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.6933574Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.6934287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.6934820Z     kernel = self.compile(
2025-05-07T20:32:04.6935357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.6936004Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.6936409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.6936635Z 
2025-05-07T20:32:04.6936845Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a701c7ad0>
2025-05-07T20:32:04.6937921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.6939286Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a704dce00>}
2025-05-07T20:32:04.6940609Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.6941619Z context = <triton._C.libtriton.ir.context object at 0x7f6a7010f370>
2025-05-07T20:32:04.6941901Z 
2025-05-07T20:32:04.6942067Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.6942584Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.6943053Z                            module_map=module_map)
2025-05-07T20:32:04.6943421Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.6943774Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.6944037Z E       ^
2025-05-07T20:32:04.6944500Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.6944947Z 
2025-05-07T20:32:04.6945356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.6945866Z 
2025-05-07T20:32:04.6945972Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.6946387Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.6946786Z     T=2048,
2025-05-07T20:32:04.6946974Z     D=5120,
2025-05-07T20:32:04.6947176Z     scale_ub=1200.0,
2025-05-07T20:32:04.6947406Z     contiguous=True,
2025-05-07T20:32:04.6948175Z     compiled=True,
2025-05-07T20:32:04.6948397Z )
2025-05-07T20:32:04.6948717Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.6949206Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:04.6949483Z 
2025-05-07T20:32:04.6949563Z     @given(
2025-05-07T20:32:04.6949797Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.6950191Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.6950512Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.6950853Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.6951193Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.6951478Z     )
2025-05-07T20:32:04.6951835Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.6952281Z     def test_silu_mul_quant(
2025-05-07T20:32:04.6952521Z         self,
2025-05-07T20:32:04.6952728Z         T: int,
2025-05-07T20:32:04.6952934Z         D: int,
2025-05-07T20:32:04.6953153Z         scale_ub: Optional[float],
2025-05-07T20:32:04.6953434Z         contiguous: bool,
2025-05-07T20:32:04.6953688Z         compiled: bool,
2025-05-07T20:32:04.6953906Z     ) -> None:
2025-05-07T20:32:04.6954130Z         torch.manual_seed(2025)
2025-05-07T20:32:04.6954377Z     
2025-05-07T20:32:04.6954647Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.6955001Z     
2025-05-07T20:32:04.6955204Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.6955519Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.6955825Z         x = x_sign * x_clamp
2025-05-07T20:32:04.6956074Z         x0 = x[:, :D]
2025-05-07T20:32:04.6956302Z         x1 = x[:, D:]
2025-05-07T20:32:04.6956515Z     
2025-05-07T20:32:04.6956698Z         if contiguous:
2025-05-07T20:32:04.6956934Z             x0 = x0.contiguous()
2025-05-07T20:32:04.6957200Z             x1 = x1.contiguous()
2025-05-07T20:32:04.6957449Z     
2025-05-07T20:32:04.6957649Z         if scale_ub is not None:
2025-05-07T20:32:04.6957933Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.6958267Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.6958583Z             )
2025-05-07T20:32:04.6958786Z         else:
2025-05-07T20:32:04.6958998Z             scale_ub_tensor = None
2025-05-07T20:32:04.6959666Z     
2025-05-07T20:32:04.6960002Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.6960416Z             op = silu_mul_quant
2025-05-07T20:32:04.6960743Z             if compiled:
2025-05-07T20:32:04.6961066Z                 op = torch.compile(op)
2025-05-07T20:32:04.6961416Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.6961698Z     
2025-05-07T20:32:04.6961907Z         y_fp8, y_scale = fn()
2025-05-07T20:32:04.6962195Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:04.6962481Z     
2025-05-07T20:32:04.6962723Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.6963056Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:04.6963348Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:04.6963664Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:04.6964090Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:04.6964403Z     
2025-05-07T20:32:04.6964619Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:04.6964816Z 
2025-05-07T20:32:04.6964927Z moe/activation_test.py:126: 
2025-05-07T20:32:04.6965219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.6965551Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:04.6965878Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:04.6966661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:04.6967582Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:04.6968133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.6968809Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.6969492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:04.6970317Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:04.6971037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:04.6971672Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:04.6972259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:04.6972778Z     fn()
2025-05-07T20:32:04.6973401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:04.6974032Z     self.fn.run(
2025-05-07T20:32:04.6974492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.6975020Z     kernel = self.compile(
2025-05-07T20:32:04.6975564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.6976213Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.6976619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.6976853Z 
2025-05-07T20:32:04.6977057Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a70340290>
2025-05-07T20:32:04.6978132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.6979489Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a701ed440>}
2025-05-07T20:32:04.6980803Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.6981824Z context = <triton._C.libtriton.ir.context object at 0x7f6a59f165f0>
2025-05-07T20:32:04.6982121Z 
2025-05-07T20:32:04.6982297Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.6982819Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.6983282Z                            module_map=module_map)
2025-05-07T20:32:04.6983666Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.6984037Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:04.6984307Z E       ^
2025-05-07T20:32:04.6984778Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.6985226Z 
2025-05-07T20:32:04.6985639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.6986145Z 
2025-05-07T20:32:04.6986258Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.6986664Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.6987069Z     T=16384,
2025-05-07T20:32:04.6987281Z     D=7168,
2025-05-07T20:32:04.6987474Z     scale_ub=1200.0,
2025-05-07T20:32:04.6987702Z     contiguous=False,
2025-05-07T20:32:04.6987934Z     compiled=False,
2025-05-07T20:32:04.6988137Z )
2025-05-07T20:32:04.8917491Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:04.8920095Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:04.8922730Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:04.8924815Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:04.8925783Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.8927087Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:04.8928449Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.8929442Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.8930662Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:04.8932026Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.8933148Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.8934415Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:04.8935655Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:04.8936861Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:04.8938060Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:04.8938889Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.8939905Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:04.8940917Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:04.8941698Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:32:04.8943000Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:04.8944318Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:04.8945420Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:04.8946511Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:04.8947669Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:04.8949015Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:04.8950059Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.8950954Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.8951680Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:04.8952689Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.9515040Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:04.9516276Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:04.9517594Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:04.9519004Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:04.9519971Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.9521258Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:04.9522614Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.9523583Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.9524795Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:04.9526159Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.9527520Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.9528786Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:04.9530143Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:04.9531343Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:04.9532523Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:04.9533443Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.9534507Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:04.9535515Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:04.9536305Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:32:04.9537514Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:04.9538785Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:04.9539897Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:04.9540915Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:04.9542085Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:04.9543422Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:04.9544532Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.9545434Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.9546157Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:04.9547159Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.1448059Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:05.1449109Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:05.1450788Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:05.1452208Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:05.1453404Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.1454741Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:05.1456107Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.1457075Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.1458284Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.1459907Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.1460967Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.1462229Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:05.1463462Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:05.1464720Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:05.1465911Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:05.1466724Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.1467736Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:05.1468740Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:05.1469526Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:32:05.1470722Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:05.1471986Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:05.1473256Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:05.1474348Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:05.1475513Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:05.1476962Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:05.1478018Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.1478935Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.1479673Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:05.1480673Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.1544682Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:05.1545963Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:05.1547280Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:05.1548674Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:05.1549639Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.1550926Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:05.1552285Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.1553259Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.1554516Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.1555870Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.1556906Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.1558263Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:05.1559746Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:05.1560951Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:05.1562277Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:05.1563087Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.1564125Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:05.1565159Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:05.1565947Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:32:05.1567137Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:05.1568395Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:05.1569497Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:05.1570520Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:05.1571686Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:05.1573075Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:05.1574127Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.1575025Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.1575763Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:05.1576766Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.9004394Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.9005129Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.9005519Z 
2025-05-07T20:32:05.9005632Z     @given(
2025-05-07T20:32:05.9005959Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.9006304Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.9006610Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.9006946Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.9007276Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.9007896Z     )
2025-05-07T20:32:05.9008260Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.9008703Z     def test_silu_mul_quant(
2025-05-07T20:32:05.9008952Z         self,
2025-05-07T20:32:05.9009145Z         T: int,
2025-05-07T20:32:05.9009350Z         D: int,
2025-05-07T20:32:05.9009573Z         scale_ub: Optional[float],
2025-05-07T20:32:05.9009844Z         contiguous: bool,
2025-05-07T20:32:05.9010243Z         compiled: bool,
2025-05-07T20:32:05.9010475Z     ) -> None:
2025-05-07T20:32:05.9010690Z         torch.manual_seed(2025)
2025-05-07T20:32:05.9010939Z     
2025-05-07T20:32:05.9011217Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.9011559Z     
2025-05-07T20:32:05.9011755Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.9012049Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.9012353Z         x = x_sign * x_clamp
2025-05-07T20:32:05.9012597Z         x0 = x[:, :D]
2025-05-07T20:32:05.9012823Z         x1 = x[:, D:]
2025-05-07T20:32:05.9013128Z     
2025-05-07T20:32:05.9013323Z         if contiguous:
2025-05-07T20:32:05.9013559Z             x0 = x0.contiguous()
2025-05-07T20:32:05.9013818Z             x1 = x1.contiguous()
2025-05-07T20:32:05.9014057Z     
2025-05-07T20:32:05.9014254Z         if scale_ub is not None:
2025-05-07T20:32:05.9014529Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.9014867Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.9015177Z             )
2025-05-07T20:32:05.9015371Z         else:
2025-05-07T20:32:05.9015583Z             scale_ub_tensor = None
2025-05-07T20:32:05.9015838Z     
2025-05-07T20:32:05.9016072Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.9016381Z             op = silu_mul_quant
2025-05-07T20:32:05.9016631Z             if compiled:
2025-05-07T20:32:05.9016881Z                 op = torch.compile(op)
2025-05-07T20:32:05.9017179Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.9017454Z     
2025-05-07T20:32:05.9017652Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.9017817Z 
2025-05-07T20:32:05.9017926Z moe/activation_test.py:117: 
2025-05-07T20:32:05.9018218Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.9018549Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.9018837Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.9019526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.9020214Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.9020752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.9021432Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.9022088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.9022622Z     kernel = self.compile(
2025-05-07T20:32:05.9023166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.9023809Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.9024242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.9024500Z 
2025-05-07T20:32:05.9024708Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a70243e90>
2025-05-07T20:32:05.9025779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.9027232Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a5a149940>}
2025-05-07T20:32:05.9028564Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.9029579Z context = <triton._C.libtriton.ir.context object at 0x7f6a59f52630>
2025-05-07T20:32:05.9029864Z 
2025-05-07T20:32:05.9030112Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.9030630Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.9031096Z                            module_map=module_map)
2025-05-07T20:32:05.9031466Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.9031816Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.9032076Z E       ^
2025-05-07T20:32:05.9032548Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.9032990Z 
2025-05-07T20:32:05.9033409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.9033911Z 
2025-05-07T20:32:05.9034024Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.9034432Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.9034836Z     T=1,
2025-05-07T20:32:05.9035022Z     D=7168,
2025-05-07T20:32:05.9035215Z     scale_ub=None,
2025-05-07T20:32:05.9035435Z     contiguous=True,
2025-05-07T20:32:05.9035664Z     compiled=True,
2025-05-07T20:32:05.9035875Z )
2025-05-07T20:32:05.9036198Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.9036681Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.9036936Z 
2025-05-07T20:32:05.9037013Z     @given(
2025-05-07T20:32:05.9037254Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.9037575Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.9037892Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.9038216Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.9038548Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.9038835Z     )
2025-05-07T20:32:05.9039179Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.9039637Z     def test_silu_mul_quant(
2025-05-07T20:32:05.9039890Z         self,
2025-05-07T20:32:05.9040090Z         T: int,
2025-05-07T20:32:05.9040304Z         D: int,
2025-05-07T20:32:05.9040534Z         scale_ub: Optional[float],
2025-05-07T20:32:05.9040803Z         contiguous: bool,
2025-05-07T20:32:05.9041052Z         compiled: bool,
2025-05-07T20:32:05.9041283Z     ) -> None:
2025-05-07T20:32:05.9041501Z         torch.manual_seed(2025)
2025-05-07T20:32:05.9041772Z     
2025-05-07T20:32:05.9042056Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.9042406Z     
2025-05-07T20:32:05.9042611Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.9042904Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.9043231Z         x = x_sign * x_clamp
2025-05-07T20:32:05.9043481Z         x0 = x[:, :D]
2025-05-07T20:32:05.9043698Z         x1 = x[:, D:]
2025-05-07T20:32:05.9043919Z     
2025-05-07T20:32:05.9044115Z         if contiguous:
2025-05-07T20:32:05.9052342Z             x0 = x0.contiguous()
2025-05-07T20:32:05.9052627Z             x1 = x1.contiguous()
2025-05-07T20:32:05.9052871Z     
2025-05-07T20:32:05.9053159Z         if scale_ub is not None:
2025-05-07T20:32:05.9053443Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.9053782Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.9054105Z             )
2025-05-07T20:32:05.9054331Z         else:
2025-05-07T20:32:05.9054565Z             scale_ub_tensor = None
2025-05-07T20:32:05.9054942Z     
2025-05-07T20:32:05.9055186Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.9055506Z             op = silu_mul_quant
2025-05-07T20:32:05.9055755Z             if compiled:
2025-05-07T20:32:05.9056010Z                 op = torch.compile(op)
2025-05-07T20:32:05.9056315Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.9056588Z     
2025-05-07T20:32:05.9056792Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.9057165Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.9057457Z     
2025-05-07T20:32:05.9057705Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.9058068Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.9058365Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.9058675Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.9059034Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.9059714Z     
2025-05-07T20:32:05.9059977Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.9060180Z 
2025-05-07T20:32:05.9060284Z moe/activation_test.py:126: 
2025-05-07T20:32:05.9060585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.9060918Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.9061246Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.9062044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.9062792Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.9063327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.9064000Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.9064688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.9065404Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.9066118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.9066751Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.9067350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.9067866Z     fn()
2025-05-07T20:32:05.9068373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.9068952Z     self.fn.run(
2025-05-07T20:32:05.9069428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.9069950Z     kernel = self.compile(
2025-05-07T20:32:05.9070497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.9071146Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.9071538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.9071773Z 
2025-05-07T20:32:05.9071981Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a70286570>
2025-05-07T20:32:05.9073059Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.9074415Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a700c0860>}
2025-05-07T20:32:05.9075922Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.9076933Z context = <triton._C.libtriton.ir.context object at 0x7f6a59a405f0>
2025-05-07T20:32:05.9077227Z 
2025-05-07T20:32:05.9077398Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.9077924Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.9078517Z                            module_map=module_map)
2025-05-07T20:32:05.9078887Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.9079257Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.9079530Z E       ^
2025-05-07T20:32:05.9079991Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.9080440Z 
2025-05-07T20:32:05.9080855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.9081367Z 
2025-05-07T20:32:05.9081472Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.9081888Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.9082282Z     T=4096,
2025-05-07T20:32:05.9082481Z     D=5120,
2025-05-07T20:32:05.9082677Z     scale_ub=None,
2025-05-07T20:32:05.9082892Z     contiguous=False,
2025-05-07T20:32:05.9083143Z     compiled=False,
2025-05-07T20:32:05.9083357Z )
2025-05-07T20:32:06.2091512Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.2093793Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:06.2095279Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.2096767Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.2097728Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.2099032Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.2100404Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.2101374Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.2102586Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.2103942Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.2104997Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.2106617Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.2107858Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:06.2109062Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.2110408Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:06.2111219Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.2112237Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:06.2113241Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:06.2114027Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:32:06.2115267Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.2116531Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.2117633Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:06.2118658Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:06.2119819Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.2121152Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.2122200Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.2123105Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.2123841Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:06.2124888Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.4160137Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.4162236Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:06.4165035Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.4166454Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.4167419Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.4169260Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.4170638Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.4171615Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.4172834Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.4174363Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.4175414Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.4176687Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.4177919Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:06.4179117Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.4180316Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:06.4181127Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.4182136Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:06.4183150Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:06.4183939Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:32:06.4185128Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.4186402Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.4187501Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:06.4188615Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:06.4189783Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.4191110Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.4192231Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.4193128Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.4193862Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:06.4195083Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.7135000Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.7136088Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:06.7137399Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.7138832Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.7139799Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.7141083Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.7142447Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.7143412Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.7144640Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.7145989Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.7147032Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.7148288Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.7149838Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:06.7151052Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.7152237Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:06.7153189Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.7154201Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:06.7155209Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:06.7155999Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:32:06.7157192Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.7158443Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.7159798Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:06.7160826Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:06.7161998Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.7163327Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.7164376Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.7165285Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.7166011Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:06.7167014Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.7237259Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.7238288Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:06.7239599Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.7240991Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.7242102Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.7243383Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.7244738Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.7245821Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.7247047Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.7248399Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.7249440Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.7250706Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.7251931Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:06.7253246Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.7254437Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:06.7255296Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.7256310Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:06.7257309Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:06.7258096Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:32:06.7259496Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.7260761Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.7261870Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:06.7262898Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:06.7264195Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.7265582Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.7266629Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.7267634Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.7268359Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:06.7269360Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.0897315Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.0898120Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:08.0898409Z 
2025-05-07T20:32:08.0898493Z     @given(
2025-05-07T20:32:08.0898746Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.0899070Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.0899384Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.0899725Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.0900170Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.0900576Z     )
2025-05-07T20:32:08.0900982Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.0901437Z     def test_silu_mul_quant(
2025-05-07T20:32:08.0901773Z         self,
2025-05-07T20:32:08.0902049Z         T: int,
2025-05-07T20:32:08.0902334Z         D: int,
2025-05-07T20:32:08.0902640Z         scale_ub: Optional[float],
2025-05-07T20:32:08.0903017Z         contiguous: bool,
2025-05-07T20:32:08.0903272Z         compiled: bool,
2025-05-07T20:32:08.0903509Z     ) -> None:
2025-05-07T20:32:08.0903726Z         torch.manual_seed(2025)
2025-05-07T20:32:08.0904035Z     
2025-05-07T20:32:08.0904421Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.0904888Z     
2025-05-07T20:32:08.0905152Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.0905493Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.0905819Z         x = x_sign * x_clamp
2025-05-07T20:32:08.0906069Z         x0 = x[:, :D]
2025-05-07T20:32:08.0906301Z         x1 = x[:, D:]
2025-05-07T20:32:08.0906522Z     
2025-05-07T20:32:08.0906719Z         if contiguous:
2025-05-07T20:32:08.0906975Z             x0 = x0.contiguous()
2025-05-07T20:32:08.0907249Z             x1 = x1.contiguous()
2025-05-07T20:32:08.0907502Z     
2025-05-07T20:32:08.0907716Z         if scale_ub is not None:
2025-05-07T20:32:08.0908011Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.0908346Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.0908670Z             )
2025-05-07T20:32:08.0908882Z         else:
2025-05-07T20:32:08.0909106Z             scale_ub_tensor = None
2025-05-07T20:32:08.0909381Z     
2025-05-07T20:32:08.0909629Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.0909960Z             op = silu_mul_quant
2025-05-07T20:32:08.0910228Z             if compiled:
2025-05-07T20:32:08.0910496Z                 op = torch.compile(op)
2025-05-07T20:32:08.0910810Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.0911090Z     
2025-05-07T20:32:08.0911295Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.0911466Z 
2025-05-07T20:32:08.0911585Z moe/activation_test.py:117: 
2025-05-07T20:32:08.0911888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.0912235Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.0912877Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.0913579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.0914279Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.0914826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.0915715Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.0916379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.0916925Z     kernel = self.compile(
2025-05-07T20:32:08.0917486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.0918146Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.0918555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.0918794Z 
2025-05-07T20:32:08.0919008Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a5a2f4740>
2025-05-07T20:32:08.0920095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.0921492Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a596c40e0>}
2025-05-07T20:32:08.0922821Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.0923849Z context = <triton._C.libtriton.ir.context object at 0x7f6a58d4b930>
2025-05-07T20:32:08.0924151Z 
2025-05-07T20:32:08.0924325Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.0924855Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.0925353Z                            module_map=module_map)
2025-05-07T20:32:08.0925762Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.0926129Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.0926410Z E       ^
2025-05-07T20:32:08.0926887Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.0927347Z 
2025-05-07T20:32:08.0927764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.0928273Z 
2025-05-07T20:32:08.0928392Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.0928817Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.0929232Z     T=4096,
2025-05-07T20:32:08.0929439Z     D=7168,
2025-05-07T20:32:08.0929642Z     scale_ub=None,
2025-05-07T20:32:08.0929878Z     contiguous=False,
2025-05-07T20:32:08.0930120Z     compiled=False,
2025-05-07T20:32:08.0930354Z )
2025-05-07T20:32:08.0930683Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.0931191Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:08.0931471Z 
2025-05-07T20:32:08.0931564Z     @given(
2025-05-07T20:32:08.0931809Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.0932135Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.0932457Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.0932793Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.0933326Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.0933629Z     )
2025-05-07T20:32:08.0934108Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.0934562Z     def test_silu_mul_quant(
2025-05-07T20:32:08.0934815Z         self,
2025-05-07T20:32:08.0935032Z         T: int,
2025-05-07T20:32:08.0935240Z         D: int,
2025-05-07T20:32:08.0935474Z         scale_ub: Optional[float],
2025-05-07T20:32:08.0935761Z         contiguous: bool,
2025-05-07T20:32:08.0936011Z         compiled: bool,
2025-05-07T20:32:08.0936335Z     ) -> None:
2025-05-07T20:32:08.0936570Z         torch.manual_seed(2025)
2025-05-07T20:32:08.0936821Z     
2025-05-07T20:32:08.0937106Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.0937459Z     
2025-05-07T20:32:08.0937662Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.0937966Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.0938288Z         x = x_sign * x_clamp
2025-05-07T20:32:08.0938537Z         x0 = x[:, :D]
2025-05-07T20:32:08.0938770Z         x1 = x[:, D:]
2025-05-07T20:32:08.0939003Z     
2025-05-07T20:32:08.0939200Z         if contiguous:
2025-05-07T20:32:08.0939445Z             x0 = x0.contiguous()
2025-05-07T20:32:08.0939721Z             x1 = x1.contiguous()
2025-05-07T20:32:08.0939980Z     
2025-05-07T20:32:08.0940185Z         if scale_ub is not None:
2025-05-07T20:32:08.0940475Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.0940823Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.0941143Z             )
2025-05-07T20:32:08.0941356Z         else:
2025-05-07T20:32:08.0941583Z             scale_ub_tensor = None
2025-05-07T20:32:08.0941846Z     
2025-05-07T20:32:08.0942094Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.0942422Z             op = silu_mul_quant
2025-05-07T20:32:08.0942683Z             if compiled:
2025-05-07T20:32:08.0942949Z                 op = torch.compile(op)
2025-05-07T20:32:08.0943260Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.0943546Z     
2025-05-07T20:32:08.0943763Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.0943933Z 
2025-05-07T20:32:08.0944049Z moe/activation_test.py:117: 
2025-05-07T20:32:08.0944362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.0944703Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.0944999Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.0945693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.0946383Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.0946927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.0947614Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.0948281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.0948824Z     kernel = self.compile(
2025-05-07T20:32:08.0949405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.0950062Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.0950470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.0950700Z 
2025-05-07T20:32:08.0950918Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a5a2f45c0>
2025-05-07T20:32:08.0951994Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.0953361Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a596c6160>}
2025-05-07T20:32:08.0954816Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.0955834Z context = <triton._C.libtriton.ir.context object at 0x7f6a58c958f0>
2025-05-07T20:32:08.0956129Z 
2025-05-07T20:32:08.0956302Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.0956925Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.0957402Z                            module_map=module_map)
2025-05-07T20:32:08.0957773Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.0958138Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.0958412Z E       ^
2025-05-07T20:32:08.0958881Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.0959765Z 
2025-05-07T20:32:08.0960326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.0960913Z 
2025-05-07T20:32:08.0961020Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.0961435Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.0961832Z     T=128,
2025-05-07T20:32:08.0962030Z     D=7168,
2025-05-07T20:32:08.0962243Z     scale_ub=None,
2025-05-07T20:32:08.0962459Z     contiguous=False,
2025-05-07T20:32:08.0962692Z     compiled=True,
2025-05-07T20:32:08.0962902Z )
2025-05-07T20:32:08.1529618Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.1530339Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:08.1530697Z 
2025-05-07T20:32:08.1530813Z     @given(
2025-05-07T20:32:08.1531108Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.1531456Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.1531774Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.1532115Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.1532443Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.1532731Z     )
2025-05-07T20:32:08.1542684Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.1543205Z     def test_silu_mul_quant(
2025-05-07T20:32:08.1543457Z         self,
2025-05-07T20:32:08.1543652Z         T: int,
2025-05-07T20:32:08.1543853Z         D: int,
2025-05-07T20:32:08.1544074Z         scale_ub: Optional[float],
2025-05-07T20:32:08.1544338Z         contiguous: bool,
2025-05-07T20:32:08.1544583Z         compiled: bool,
2025-05-07T20:32:08.1544809Z     ) -> None:
2025-05-07T20:32:08.1545022Z         torch.manual_seed(2025)
2025-05-07T20:32:08.1545314Z     
2025-05-07T20:32:08.1545604Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.1545943Z     
2025-05-07T20:32:08.1546139Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.1546432Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.1546732Z         x = x_sign * x_clamp
2025-05-07T20:32:08.1546980Z         x0 = x[:, :D]
2025-05-07T20:32:08.1547202Z         x1 = x[:, D:]
2025-05-07T20:32:08.1547409Z     
2025-05-07T20:32:08.1547598Z         if contiguous:
2025-05-07T20:32:08.1547840Z             x0 = x0.contiguous()
2025-05-07T20:32:08.1548092Z             x1 = x1.contiguous()
2025-05-07T20:32:08.1548333Z     
2025-05-07T20:32:08.1548528Z         if scale_ub is not None:
2025-05-07T20:32:08.1548801Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.1549128Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.1549438Z             )
2025-05-07T20:32:08.1549634Z         else:
2025-05-07T20:32:08.1549844Z             scale_ub_tensor = None
2025-05-07T20:32:08.1550107Z     
2025-05-07T20:32:08.1550642Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.1550958Z             op = silu_mul_quant
2025-05-07T20:32:08.1551214Z             if compiled:
2025-05-07T20:32:08.1551466Z                 op = torch.compile(op)
2025-05-07T20:32:08.1551755Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.1552033Z     
2025-05-07T20:32:08.1552234Z         y_fp8, y_scale = fn()
2025-05-07T20:32:08.1552511Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:08.1552952Z     
2025-05-07T20:32:08.1553188Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.1553521Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:08.1553808Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:08.1554119Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:08.1554470Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:08.1554775Z     
2025-05-07T20:32:08.1554986Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:08.1555182Z 
2025-05-07T20:32:08.1555314Z moe/activation_test.py:126: 
2025-05-07T20:32:08.1555629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.1555967Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:08.1556296Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:08.1557077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:08.1557827Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:08.1558374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.1559061Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.1560004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:08.1560728Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:08.1561449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:08.1562080Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:08.1562668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:08.1563196Z     fn()
2025-05-07T20:32:08.1563706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:08.1564285Z     self.fn.run(
2025-05-07T20:32:08.1564747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.1565278Z     kernel = self.compile(
2025-05-07T20:32:08.1565825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.1566469Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.1566875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.1567113Z 
2025-05-07T20:32:08.1567318Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a597f8b30>
2025-05-07T20:32:08.1568387Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.1569755Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a5a226340>}
2025-05-07T20:32:08.1571200Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.1572212Z context = <triton._C.libtriton.ir.context object at 0x7f6a58ac4d70>
2025-05-07T20:32:08.1572497Z 
2025-05-07T20:32:08.1572674Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.1573280Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.1573862Z                            module_map=module_map)
2025-05-07T20:32:08.1574228Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.1574588Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:08.1574851Z E       ^
2025-05-07T20:32:08.1575310Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.1575752Z 
2025-05-07T20:32:08.1576168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.1576677Z 
2025-05-07T20:32:08.1576790Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.1577193Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.1577589Z     T=128,
2025-05-07T20:32:08.1577784Z     D=7168,
2025-05-07T20:32:08.1577978Z     scale_ub=None,
2025-05-07T20:32:08.1578202Z     contiguous=False,
2025-05-07T20:32:08.1578432Z     compiled=False,
2025-05-07T20:32:08.1578643Z )
2025-05-07T20:32:08.3543280Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.3543996Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:08.3544367Z 
2025-05-07T20:32:08.3544483Z     @given(
2025-05-07T20:32:08.3544795Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.3545234Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.3545577Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.3545931Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.3546269Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.3546570Z     )
2025-05-07T20:32:08.3546922Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.3547382Z     def test_silu_mul_quant(
2025-05-07T20:32:08.3547642Z         self,
2025-05-07T20:32:08.3547851Z         T: int,
2025-05-07T20:32:08.3548054Z         D: int,
2025-05-07T20:32:08.3548295Z         scale_ub: Optional[float],
2025-05-07T20:32:08.3548578Z         contiguous: bool,
2025-05-07T20:32:08.3548819Z         compiled: bool,
2025-05-07T20:32:08.3549058Z     ) -> None:
2025-05-07T20:32:08.3549287Z         torch.manual_seed(2025)
2025-05-07T20:32:08.3549534Z     
2025-05-07T20:32:08.3549811Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.3550163Z     
2025-05-07T20:32:08.3550362Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.3550678Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.3551028Z         x = x_sign * x_clamp
2025-05-07T20:32:08.3551284Z         x0 = x[:, :D]
2025-05-07T20:32:08.3551510Z         x1 = x[:, D:]
2025-05-07T20:32:08.3551722Z     
2025-05-07T20:32:08.3551921Z         if contiguous:
2025-05-07T20:32:08.3552166Z             x0 = x0.contiguous()
2025-05-07T20:32:08.3552428Z             x1 = x1.contiguous()
2025-05-07T20:32:08.3552687Z     
2025-05-07T20:32:08.3552893Z         if scale_ub is not None:
2025-05-07T20:32:08.3553173Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.3553523Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.3553843Z             )
2025-05-07T20:32:08.3554037Z         else:
2025-05-07T20:32:08.3554261Z             scale_ub_tensor = None
2025-05-07T20:32:08.3554523Z     
2025-05-07T20:32:08.3554768Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.3555085Z             op = silu_mul_quant
2025-05-07T20:32:08.3555705Z             if compiled:
2025-05-07T20:32:08.3555969Z                 op = torch.compile(op)
2025-05-07T20:32:08.3556266Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.3556543Z     
2025-05-07T20:32:08.3556740Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.3556905Z 
2025-05-07T20:32:08.3557007Z moe/activation_test.py:117: 
2025-05-07T20:32:08.3557303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.3557766Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.3558042Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.3558730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.3559727Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.3560275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.3560962Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.3561631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.3562167Z     kernel = self.compile(
2025-05-07T20:32:08.3562725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.3563371Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.3563777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.3564004Z 
2025-05-07T20:32:08.3564222Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a59997e90>
2025-05-07T20:32:08.3565298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.3566669Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a58fe8180>}
2025-05-07T20:32:08.3568003Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.3569021Z context = <triton._C.libtriton.ir.context object at 0x7f6a58587170>
2025-05-07T20:32:08.3569307Z 
2025-05-07T20:32:08.3569480Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.3569996Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.3570470Z                            module_map=module_map)
2025-05-07T20:32:08.3570843Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.3571206Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.3571466Z E       ^
2025-05-07T20:32:08.3571938Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.3572384Z 
2025-05-07T20:32:08.3572808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.3573436Z 
2025-05-07T20:32:08.3573549Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.3573968Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.3574378Z     T=4096,
2025-05-07T20:32:08.3574573Z     D=5120,
2025-05-07T20:32:08.3574767Z     scale_ub=1200.0,
2025-05-07T20:32:08.3575005Z     contiguous=True,
2025-05-07T20:32:08.3575259Z     compiled=False,
2025-05-07T20:32:08.3575490Z )
2025-05-07T20:32:08.3575816Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.3576312Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:08.3576709Z 
2025-05-07T20:32:08.3576791Z     @given(
2025-05-07T20:32:08.3577030Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.3577350Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.3577660Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.3577989Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.3578318Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.3578760Z     )
2025-05-07T20:32:08.3579112Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.3579555Z     def test_silu_mul_quant(
2025-05-07T20:32:08.3579817Z         self,
2025-05-07T20:32:08.3580006Z         T: int,
2025-05-07T20:32:08.3580211Z         D: int,
2025-05-07T20:32:08.3580434Z         scale_ub: Optional[float],
2025-05-07T20:32:08.3580708Z         contiguous: bool,
2025-05-07T20:32:08.3580952Z         compiled: bool,
2025-05-07T20:32:08.3581185Z     ) -> None:
2025-05-07T20:32:08.3581407Z         torch.manual_seed(2025)
2025-05-07T20:32:08.3581654Z     
2025-05-07T20:32:08.3581932Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.3582266Z     
2025-05-07T20:32:08.3582461Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.3582756Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.3583065Z         x = x_sign * x_clamp
2025-05-07T20:32:08.3583324Z         x0 = x[:, :D]
2025-05-07T20:32:08.3583545Z         x1 = x[:, D:]
2025-05-07T20:32:08.3583754Z     
2025-05-07T20:32:08.3583944Z         if contiguous:
2025-05-07T20:32:08.3584181Z             x0 = x0.contiguous()
2025-05-07T20:32:08.3584435Z             x1 = x1.contiguous()
2025-05-07T20:32:08.3584681Z     
2025-05-07T20:32:08.3584879Z         if scale_ub is not None:
2025-05-07T20:32:08.3585160Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.3585499Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.3585819Z             )
2025-05-07T20:32:08.3586017Z         else:
2025-05-07T20:32:08.3586221Z             scale_ub_tensor = None
2025-05-07T20:32:08.3586474Z     
2025-05-07T20:32:08.3586710Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.3587020Z             op = silu_mul_quant
2025-05-07T20:32:08.3587271Z             if compiled:
2025-05-07T20:32:08.3587521Z                 op = torch.compile(op)
2025-05-07T20:32:08.3587822Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.3588101Z     
2025-05-07T20:32:08.3588297Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.3588462Z 
2025-05-07T20:32:08.3588561Z moe/activation_test.py:117: 
2025-05-07T20:32:08.3588862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.3589197Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.3589479Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.3590161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.3590845Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.3591379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.3592048Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.3592706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.3593240Z     kernel = self.compile(
2025-05-07T20:32:08.3593778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.3594416Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.3594816Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.3595048Z 
2025-05-07T20:32:08.3595392Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a58de8f20>
2025-05-07T20:32:08.3596466Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.3597816Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a58fe9ee0>}
2025-05-07T20:32:08.3599221Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.3600237Z context = <triton._C.libtriton.ir.context object at 0x7f6a58b28170>
2025-05-07T20:32:08.3600527Z 
2025-05-07T20:32:08.3600706Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.3601238Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.3601708Z                            module_map=module_map)
2025-05-07T20:32:08.3602077Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.3602439Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.3602699Z E       ^
2025-05-07T20:32:08.3603172Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.3603625Z 
2025-05-07T20:32:08.3604045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.3604552Z 
2025-05-07T20:32:08.3604666Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.3605084Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.3605513Z     T=1,
2025-05-07T20:32:08.3605725Z     D=5120,
2025-05-07T20:32:08.3605917Z     scale_ub=None,
2025-05-07T20:32:08.3606148Z     contiguous=True,
2025-05-07T20:32:08.3606373Z     compiled=True,
2025-05-07T20:32:08.3606573Z )
2025-05-07T20:32:08.6077083Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:08.6078284Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:08.6079646Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:08.6081066Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:08.6082055Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.6083351Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:08.6084724Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.6085701Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.6087205Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:08.6088567Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.6089629Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.6091029Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:08.6092260Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:08.6093561Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:08.6094754Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:08.6095563Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.6096573Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:08.6097585Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:08.6098368Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:32:08.6099556Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:08.6100824Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:08.6101928Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:08.6102948Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:08.6104118Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:08.6105510Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:08.6106552Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.6107455Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.6108184Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:08.6109188Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.6790790Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:08.6791848Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:08.6793158Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:08.6794715Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:08.6795685Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.6796972Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:08.6798334Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.6799323Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.6800532Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:08.6801890Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.6802937Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.6804203Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:08.6805432Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:08.6806635Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:08.6807834Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:08.6808657Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.6809661Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:08.6810670Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:08.6811453Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:32:08.6812762Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:08.6814112Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:08.6815207Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:08.6816307Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:08.6817464Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:08.6818802Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:08.6819847Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.6820734Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.6821464Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:08.6822459Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.8881437Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:08.8883864Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:08.8885783Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:08.8887205Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:08.8888160Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.8889446Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:08.8890798Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.8891755Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.8892962Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:08.8894710Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.8895761Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.8897019Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:08.8898372Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:08.8899570Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:08.8900758Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:08.8901580Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.8902588Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:08.8903584Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:08.8904367Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:32:08.8905556Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:08.8906829Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:08.8907929Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:08.8908965Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:08.8910125Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:08.8911465Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:08.8912521Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.8913409Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.8914148Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:08.8915160Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.8988457Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:08.8989975Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:08.8991287Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:08.8992681Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:08.8993769Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.8995055Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:08.8996472Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.8997428Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.8998635Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:08.8999984Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.9001028Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.9002283Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:08.9003503Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:08.9004695Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:08.9005931Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:08.9006751Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.9007757Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:08.9008771Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:08.9009560Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:32:08.9010747Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:08.9012090Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:08.9013299Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:08.9014329Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:08.9015594Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:08.9025431Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:08.9026554Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.9027452Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.9028174Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:08.9029168Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.1219444Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.1220071Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:09.1220339Z 
2025-05-07T20:32:09.1220426Z     @given(
2025-05-07T20:32:09.1220684Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.1221014Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.1221358Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.1221699Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.1222046Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.1222346Z     )
2025-05-07T20:32:09.1222702Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.1223157Z     def test_silu_mul_quant(
2025-05-07T20:32:09.1223424Z         self,
2025-05-07T20:32:09.1223626Z         T: int,
2025-05-07T20:32:09.1223843Z         D: int,
2025-05-07T20:32:09.1224080Z         scale_ub: Optional[float],
2025-05-07T20:32:09.1224361Z         contiguous: bool,
2025-05-07T20:32:09.1224628Z         compiled: bool,
2025-05-07T20:32:09.1224875Z     ) -> None:
2025-05-07T20:32:09.1225108Z         torch.manual_seed(2025)
2025-05-07T20:32:09.1225367Z     
2025-05-07T20:32:09.1225656Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.1226012Z     
2025-05-07T20:32:09.1226239Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.1226543Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.1226873Z         x = x_sign * x_clamp
2025-05-07T20:32:09.1227127Z         x0 = x[:, :D]
2025-05-07T20:32:09.1227348Z         x1 = x[:, D:]
2025-05-07T20:32:09.1227573Z     
2025-05-07T20:32:09.1227774Z         if contiguous:
2025-05-07T20:32:09.1228013Z             x0 = x0.contiguous()
2025-05-07T20:32:09.1228293Z             x1 = x1.contiguous()
2025-05-07T20:32:09.1228554Z     
2025-05-07T20:32:09.1228757Z         if scale_ub is not None:
2025-05-07T20:32:09.1229038Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.1229391Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.1229709Z             )
2025-05-07T20:32:09.1229921Z         else:
2025-05-07T20:32:09.1230148Z             scale_ub_tensor = None
2025-05-07T20:32:09.1230417Z     
2025-05-07T20:32:09.1230665Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.1231311Z             op = silu_mul_quant
2025-05-07T20:32:09.1231574Z             if compiled:
2025-05-07T20:32:09.1231842Z                 op = torch.compile(op)
2025-05-07T20:32:09.1232151Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.1232438Z     
2025-05-07T20:32:09.1232638Z         y_fp8, y_scale = fn()
2025-05-07T20:32:09.1232935Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:09.1233392Z     
2025-05-07T20:32:09.1233642Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.1233995Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:09.1234305Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:09.1234626Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:09.1235000Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:09.1235321Z     
2025-05-07T20:32:09.1235534Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:09.1235739Z 
2025-05-07T20:32:09.1235854Z moe/activation_test.py:126: 
2025-05-07T20:32:09.1236166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.1236518Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:09.1236851Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:09.1237653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:09.1238421Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:09.1238969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.1239665Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.1240360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:09.1241094Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:09.1241818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:09.1242465Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:09.1243077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:09.1243608Z     fn()
2025-05-07T20:32:09.1244124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:09.1244714Z     self.fn.run(
2025-05-07T20:32:09.1245197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.1245727Z     kernel = self.compile(
2025-05-07T20:32:09.1246278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.1246940Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.1247346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.1247576Z 
2025-05-07T20:32:09.1247786Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a58de9040>
2025-05-07T20:32:09.1248865Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.1250250Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a58feb420>}
2025-05-07T20:32:09.1251689Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.1252707Z context = <triton._C.libtriton.ir.context object at 0x7f6a582c78b0>
2025-05-07T20:32:09.1253119Z 
2025-05-07T20:32:09.1253294Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.1253824Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.1254303Z                            module_map=module_map)
2025-05-07T20:32:09.1254757Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.1255135Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:09.1255423Z E       ^
2025-05-07T20:32:09.1255936Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.1256394Z 
2025-05-07T20:32:09.1256810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.1257326Z 
2025-05-07T20:32:09.1257440Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.1257863Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.1258266Z     T=2048,
2025-05-07T20:32:09.1258474Z     D=5120,
2025-05-07T20:32:09.1258682Z     scale_ub=None,
2025-05-07T20:32:09.1258902Z     contiguous=True,
2025-05-07T20:32:09.1259139Z     compiled=True,
2025-05-07T20:32:09.1259657Z )
2025-05-07T20:32:09.3665975Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:09.3667089Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:09.3668428Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:09.3669840Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:09.3670808Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.3672107Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:09.3673461Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.3674449Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.3675672Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:09.3677030Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.3678091Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.3679694Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:09.3680921Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:09.3682150Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:09.3683491Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:09.3684316Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.3685323Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:09.3686336Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:09.3687123Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:32:09.3688322Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:09.3689592Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:09.3690683Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:09.3691715Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:09.3692880Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:09.3694322Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:09.3695374Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.3696321Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.3697059Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:09.3698059Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.4375570Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:09.4377677Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:09.4380294Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:09.4383596Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:09.4385509Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.4386796Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:09.4388334Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.4389301Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.4390522Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:09.4391878Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.4392938Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.4394199Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:09.4395432Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:09.4396695Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:09.4397888Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:09.4398703Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.4399716Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:09.4400719Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:09.4401512Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:32:09.4402696Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:09.4403967Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:09.4405073Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:09.4406154Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:09.4407399Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:09.4408729Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:09.4409849Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.4410746Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.4411484Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:09.4412480Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.6445656Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:09.6446770Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:09.6448141Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:09.6449563Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:09.6450548Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.6451833Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:09.6453291Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.6454270Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.6455507Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:09.6456872Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.6457931Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.6459551Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:09.6460799Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:09.6462337Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:09.6463542Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:09.6464370Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.6465520Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:09.6466582Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:09.6467376Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:32:09.6468572Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:09.6469842Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:09.6470952Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:09.6471993Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:09.6473172Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:09.6474515Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:09.6475578Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.6476540Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.6477277Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:09.6478290Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.6546565Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:09.6547600Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:09.6548920Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:09.6550323Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:09.6551295Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.6552691Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:09.6554061Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.6555105Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.6556320Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:09.6557687Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.6558742Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.6560233Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:09.6561472Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:09.6562684Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:09.6563889Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:09.6564715Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.6565782Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:09.6566797Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:09.6567587Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:32:09.6568798Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:09.6570068Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:09.6571169Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:09.6572204Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:09.6573422Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:09.6574951Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:09.6576005Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.6576899Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.6577751Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:09.6578755Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.8666109Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.8666813Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:09.8667084Z 
2025-05-07T20:32:09.8667166Z     @given(
2025-05-07T20:32:09.8667416Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.8667737Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.8668053Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.8668379Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.8668716Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.8669021Z     )
2025-05-07T20:32:09.8669371Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.8669825Z     def test_silu_mul_quant(
2025-05-07T20:32:09.8670073Z         self,
2025-05-07T20:32:09.8670267Z         T: int,
2025-05-07T20:32:09.8670470Z         D: int,
2025-05-07T20:32:09.8670700Z         scale_ub: Optional[float],
2025-05-07T20:32:09.8670969Z         contiguous: bool,
2025-05-07T20:32:09.8671215Z         compiled: bool,
2025-05-07T20:32:09.8671453Z     ) -> None:
2025-05-07T20:32:09.8671668Z         torch.manual_seed(2025)
2025-05-07T20:32:09.8671915Z     
2025-05-07T20:32:09.8672188Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.8672527Z     
2025-05-07T20:32:09.8672728Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.8673021Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.8673331Z         x = x_sign * x_clamp
2025-05-07T20:32:09.8673582Z         x0 = x[:, :D]
2025-05-07T20:32:09.8673808Z         x1 = x[:, D:]
2025-05-07T20:32:09.8674027Z     
2025-05-07T20:32:09.8674218Z         if contiguous:
2025-05-07T20:32:09.8674458Z             x0 = x0.contiguous()
2025-05-07T20:32:09.8674717Z             x1 = x1.contiguous()
2025-05-07T20:32:09.8674952Z     
2025-05-07T20:32:09.8675147Z         if scale_ub is not None:
2025-05-07T20:32:09.8675428Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.8675772Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.8676133Z             )
2025-05-07T20:32:09.8676337Z         else:
2025-05-07T20:32:09.8676548Z             scale_ub_tensor = None
2025-05-07T20:32:09.8676810Z     
2025-05-07T20:32:09.8677052Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.8677359Z             op = silu_mul_quant
2025-05-07T20:32:09.8677612Z             if compiled:
2025-05-07T20:32:09.8677871Z                 op = torch.compile(op)
2025-05-07T20:32:09.8678182Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.8678458Z     
2025-05-07T20:32:09.8678662Z         y_fp8, y_scale = fn()
2025-05-07T20:32:09.8678967Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:09.8679261Z     
2025-05-07T20:32:09.8679509Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.8679852Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:09.8680145Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:09.8680824Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:09.8681188Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:09.8681493Z     
2025-05-07T20:32:09.8681710Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:09.8681911Z 
2025-05-07T20:32:09.8682017Z moe/activation_test.py:126: 
2025-05-07T20:32:09.8682321Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.8682797Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:09.8683127Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:09.8683914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:09.8684656Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:09.8685200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.8685937Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.8686620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:09.8687328Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:09.8688050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:09.8688690Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:09.8689290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:09.8689796Z     fn()
2025-05-07T20:32:09.8690301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:09.8690878Z     self.fn.run(
2025-05-07T20:32:09.8691344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.8691873Z     kernel = self.compile(
2025-05-07T20:32:09.8692412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.8693161Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.8693557Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.8693794Z 
2025-05-07T20:32:09.8694002Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a58bd2bd0>
2025-05-07T20:32:09.8695071Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.8696448Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a58561800>}
2025-05-07T20:32:09.8697770Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.8698789Z context = <triton._C.libtriton.ir.context object at 0x7f6a588c4370>
2025-05-07T20:32:09.8699078Z 
2025-05-07T20:32:09.8699251Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.8699773Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.8700236Z                            module_map=module_map)
2025-05-07T20:32:09.8700610Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.8700975Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:09.8701244Z E       ^
2025-05-07T20:32:09.8701809Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.8702265Z 
2025-05-07T20:32:09.8702678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.8703181Z 
2025-05-07T20:32:09.8703293Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.8703703Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.8704196Z     T=128,
2025-05-07T20:32:09.8704395Z     D=5120,
2025-05-07T20:32:09.8704588Z     scale_ub=None,
2025-05-07T20:32:09.8704808Z     contiguous=True,
2025-05-07T20:32:09.8705040Z     compiled=True,
2025-05-07T20:32:09.8705253Z )
2025-05-07T20:32:10.1189819Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:10.1191753Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:10.1194080Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:10.1196471Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:10.1198221Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.1200544Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:10.1203012Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.1204762Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.1206934Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:10.1209363Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.1211094Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.1213464Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:10.1215634Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:10.1217743Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:10.1219790Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:10.1221569Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.1223316Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:10.1225036Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:10.1226653Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:32:10.1228774Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:10.1230949Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:10.1232932Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:10.1234736Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:10.1236806Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:10.1239203Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:10.1241008Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.1242593Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.1243852Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:10.1245652Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.1921610Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:10.1933678Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:10.1936093Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:10.1938652Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:10.1940357Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.1942699Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:10.1945523Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.1947284Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.1949254Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:10.1951936Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.1953778Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.1955916Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:10.1958020Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:10.1960445Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:10.1962493Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:10.1963820Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.1965616Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:10.1967444Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:10.1968787Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:32:10.1970870Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:10.1973180Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:10.1975103Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:10.1976785Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:10.1978582Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:10.1980698Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:10.1982459Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.1984255Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.1985401Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:10.1986896Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.4026914Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:10.4029238Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:10.4031514Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:10.4034052Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:10.4035639Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.4037757Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:10.4039992Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.4041611Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.4043621Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:10.4045840Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.4047538Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.4049622Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:10.4051751Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:10.4054043Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:10.4056113Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:10.4057578Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.4059747Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:10.4061757Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:10.4063183Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:32:10.4065346Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:10.4067810Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:10.4069785Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:10.4071647Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:10.4073753Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:10.4076183Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:10.4078078Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.4079670Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.4080976Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:10.4082805Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.4147825Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:10.4149599Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:10.4151878Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:10.4154373Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:10.4156023Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.4158125Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:10.4160673Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.4162302Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.4164530Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:10.4166946Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.4168928Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.4171181Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:10.4173500Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:10.4175585Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:10.4177750Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:10.4179222Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.4180956Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:10.4182674Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:10.4184080Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:32:10.4186240Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:10.4188489Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:10.4190371Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:10.4192146Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:10.4194104Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:10.4196367Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:10.4198214Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.4199710Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.4200967Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:10.4202859Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.6681183Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.6682051Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:10.6682497Z 
2025-05-07T20:32:10.6682620Z     @given(
2025-05-07T20:32:10.6682985Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.6683911Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.6684408Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.6684943Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.6685461Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.6685936Z     )
2025-05-07T20:32:10.6686501Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.6687205Z     def test_silu_mul_quant(
2025-05-07T20:32:10.6687558Z         self,
2025-05-07T20:32:10.6687869Z         T: int,
2025-05-07T20:32:10.6688172Z         D: int,
2025-05-07T20:32:10.6688515Z         scale_ub: Optional[float],
2025-05-07T20:32:10.6688948Z         contiguous: bool,
2025-05-07T20:32:10.6689327Z         compiled: bool,
2025-05-07T20:32:10.6689687Z     ) -> None:
2025-05-07T20:32:10.6690022Z         torch.manual_seed(2025)
2025-05-07T20:32:10.6690410Z     
2025-05-07T20:32:10.6690828Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.6691378Z     
2025-05-07T20:32:10.6691700Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.6692174Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.6692684Z         x = x_sign * x_clamp
2025-05-07T20:32:10.6693203Z         x0 = x[:, :D]
2025-05-07T20:32:10.6693533Z         x1 = x[:, D:]
2025-05-07T20:32:10.6693854Z     
2025-05-07T20:32:10.6694152Z         if contiguous:
2025-05-07T20:32:10.6694519Z             x0 = x0.contiguous()
2025-05-07T20:32:10.6694948Z             x1 = x1.contiguous()
2025-05-07T20:32:10.6695348Z     
2025-05-07T20:32:10.6695640Z         if scale_ub is not None:
2025-05-07T20:32:10.6696080Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.6696626Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.6697120Z             )
2025-05-07T20:32:10.6697409Z         else:
2025-05-07T20:32:10.6697732Z             scale_ub_tensor = None
2025-05-07T20:32:10.6698134Z     
2025-05-07T20:32:10.6698492Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.6698994Z             op = silu_mul_quant
2025-05-07T20:32:10.6699390Z             if compiled:
2025-05-07T20:32:10.6699770Z                 op = torch.compile(op)
2025-05-07T20:32:10.6700240Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.6700679Z     
2025-05-07T20:32:10.6700974Z         y_fp8, y_scale = fn()
2025-05-07T20:32:10.6701426Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:10.6701890Z     
2025-05-07T20:32:10.6702266Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.6702790Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:10.6703264Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:10.6703766Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:10.6704359Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:10.6704885Z     
2025-05-07T20:32:10.6705212Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:10.6705506Z 
2025-05-07T20:32:10.6705658Z moe/activation_test.py:126: 
2025-05-07T20:32:10.6706160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.6706636Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:10.6707095Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:10.6708429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:10.6709485Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:10.6710236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.6711158Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.6712117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:10.6713218Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:10.6714224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:10.6715104Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:10.6715929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:10.6716642Z     fn()
2025-05-07T20:32:10.6717353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:10.6718151Z     self.fn.run(
2025-05-07T20:32:10.6718786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.6719520Z     kernel = self.compile(
2025-05-07T20:32:10.6720258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.6721163Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.6721700Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.6722010Z 
2025-05-07T20:32:10.6722286Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a58257740>
2025-05-07T20:32:10.6723780Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.6725736Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a58bad620>}
2025-05-07T20:32:10.6727866Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.6729538Z context = <triton._C.libtriton.ir.context object at 0x7f6a58141bb0>
2025-05-07T20:32:10.6729968Z 
2025-05-07T20:32:10.6730223Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.6730952Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.6731608Z                            module_map=module_map)
2025-05-07T20:32:10.6732121Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.6732612Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:10.6732974Z E       ^
2025-05-07T20:32:10.6733773Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.6734437Z 
2025-05-07T20:32:10.6735073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.6735872Z 
2025-05-07T20:32:10.6736044Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.6736728Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.6737373Z     T=4096,
2025-05-07T20:32:10.6737689Z     D=5120,
2025-05-07T20:32:10.6738007Z     scale_ub=None,
2025-05-07T20:32:10.6738376Z     contiguous=True,
2025-05-07T20:32:10.6738718Z     compiled=True,
2025-05-07T20:32:10.6739023Z )
2025-05-07T20:32:10.9274356Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:10.9275706Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:10.9277047Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:10.9278641Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:10.9279613Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.9280911Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:10.9282284Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.9283276Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.9284482Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:10.9285841Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.9286896Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.9288171Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:10.9289405Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:10.9290611Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:10.9291799Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:10.9292621Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.9293749Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:10.9294763Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:10.9295546Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:10.9296908Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:10.9298180Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:10.9299294Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:10.9300399Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:10.9301559Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:10.9302915Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:10.9303972Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.9304880Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.9305618Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:10.9306636Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.9985482Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:10.9986565Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:10.9987897Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:10.9989324Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:10.9990293Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.9991591Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:10.9992960Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.9993936Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.9995159Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:10.9996519Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.9997926Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.9999202Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:11.0000719Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:11.0001926Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:11.0003125Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:11.0003948Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.0004968Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:11.0005977Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:11.0006780Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:11.0007982Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:11.0009258Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:11.0010374Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:11.0011405Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:11.0012568Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:11.0014007Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:11.0015060Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.0015961Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.0016701Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:11.0017708Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.2071648Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:11.2073252Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:11.2074586Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:11.2076019Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:11.2077141Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.2078432Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:11.2079815Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.2080793Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.2082018Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:11.2083396Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.2084448Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.2085719Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:11.2086956Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:11.2088178Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:11.2089380Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:11.2090205Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.2091404Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:11.2092425Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:11.2093301Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:11.2094499Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:11.2095759Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:11.2096951Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:11.2097987Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:11.2099154Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:11.2100582Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:11.2101628Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.2102535Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.2103274Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:11.2104276Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.2168601Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:11.2169781Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:11.2171109Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:11.2172510Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:11.2173550Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.2174847Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:11.2176219Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.2177193Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.2178407Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:11.2179771Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.2180819Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.2182260Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:11.2183492Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:11.2184691Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:11.2185993Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:11.2186867Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.2187879Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:11.2188880Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:11.2189658Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:11.2190849Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:11.2192099Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:11.2193207Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:11.2194241Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:11.2195400Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:11.2196789Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:11.2197826Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.2198721Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.2199453Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:11.2200464Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.4725881Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.4726518Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.4726901Z 
2025-05-07T20:32:11.4726996Z     @given(
2025-05-07T20:32:11.4727250Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.4727570Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.4727876Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.4728213Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.4728888Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.4729190Z     )
2025-05-07T20:32:11.4729544Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.4729997Z     def test_silu_mul_quant(
2025-05-07T20:32:11.4730251Z         self,
2025-05-07T20:32:11.4730445Z         T: int,
2025-05-07T20:32:11.4730649Z         D: int,
2025-05-07T20:32:11.4730872Z         scale_ub: Optional[float],
2025-05-07T20:32:11.4731276Z         contiguous: bool,
2025-05-07T20:32:11.4731522Z         compiled: bool,
2025-05-07T20:32:11.4731759Z     ) -> None:
2025-05-07T20:32:11.4731972Z         torch.manual_seed(2025)
2025-05-07T20:32:11.4732224Z     
2025-05-07T20:32:11.4732502Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.4732840Z     
2025-05-07T20:32:11.4733113Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.4733409Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.4733719Z         x = x_sign * x_clamp
2025-05-07T20:32:11.4733969Z         x0 = x[:, :D]
2025-05-07T20:32:11.4734197Z         x1 = x[:, D:]
2025-05-07T20:32:11.4734408Z     
2025-05-07T20:32:11.4734599Z         if contiguous:
2025-05-07T20:32:11.4734836Z             x0 = x0.contiguous()
2025-05-07T20:32:11.4735098Z             x1 = x1.contiguous()
2025-05-07T20:32:11.4735337Z     
2025-05-07T20:32:11.4735538Z         if scale_ub is not None:
2025-05-07T20:32:11.4735824Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.4736166Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.4736484Z             )
2025-05-07T20:32:11.4736685Z         else:
2025-05-07T20:32:11.4736900Z             scale_ub_tensor = None
2025-05-07T20:32:11.4737155Z     
2025-05-07T20:32:11.4737399Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.4737711Z             op = silu_mul_quant
2025-05-07T20:32:11.4737971Z             if compiled:
2025-05-07T20:32:11.4738228Z                 op = torch.compile(op)
2025-05-07T20:32:11.4738538Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.4738821Z     
2025-05-07T20:32:11.4739021Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.4739308Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.4739600Z     
2025-05-07T20:32:11.4739850Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.4740204Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.4740507Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.4740832Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.4741208Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.4749084Z     
2025-05-07T20:32:11.4749422Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.4749641Z 
2025-05-07T20:32:11.4749748Z moe/activation_test.py:126: 
2025-05-07T20:32:11.4750058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.4750401Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.4750735Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.4751520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.4752263Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.4752803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.4753490Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.4754170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.4754875Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.4755596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.4756382Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.4756986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.4757496Z     fn()
2025-05-07T20:32:11.4758018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.4758746Z     self.fn.run(
2025-05-07T20:32:11.4759580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.4760191Z     kernel = self.compile(
2025-05-07T20:32:11.4760742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.4761397Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.4761799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.4762042Z 
2025-05-07T20:32:11.4762252Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a5846f0b0>
2025-05-07T20:32:11.4763332Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.4764705Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a585fcae0>}
2025-05-07T20:32:11.4766040Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.4767096Z context = <triton._C.libtriton.ir.context object at 0x7f6a43db7870>
2025-05-07T20:32:11.4767389Z 
2025-05-07T20:32:11.4767566Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.4768086Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.4768557Z                            module_map=module_map)
2025-05-07T20:32:11.4768928Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.4769296Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.4769573Z E       ^
2025-05-07T20:32:11.4770044Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.4770498Z 
2025-05-07T20:32:11.4770912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.4771416Z 
2025-05-07T20:32:11.4771535Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.4771965Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.4772365Z     T=16384,
2025-05-07T20:32:11.4772581Z     D=5120,
2025-05-07T20:32:11.4772790Z     scale_ub=None,
2025-05-07T20:32:11.4773099Z     contiguous=True,
2025-05-07T20:32:11.4773337Z     compiled=True,
2025-05-07T20:32:11.4773564Z )
2025-05-07T20:32:11.5050897Z W0507 20:32:11.503000 97872 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:11.5053442Z W0507 20:32:11.503000 97872 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:11.5056094Z W0507 20:32:11.503000 97872 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:11.5057232Z W0507 20:32:11.503000 97872 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:11.5058655Z W0507 20:32:11.503000 97872 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:11.5935262Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.5935796Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.5936412Z 
2025-05-07T20:32:11.5936534Z     @given(
2025-05-07T20:32:11.5936853Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.5937287Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.5937712Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.5938106Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.5938445Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.5938740Z     )
2025-05-07T20:32:11.5939094Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.5939554Z     def test_silu_mul_quant(
2025-05-07T20:32:11.5939813Z         self,
2025-05-07T20:32:11.5940015Z         T: int,
2025-05-07T20:32:11.5940231Z         D: int,
2025-05-07T20:32:11.5940461Z         scale_ub: Optional[float],
2025-05-07T20:32:11.5940748Z         contiguous: bool,
2025-05-07T20:32:11.5940996Z         compiled: bool,
2025-05-07T20:32:11.5941243Z     ) -> None:
2025-05-07T20:32:11.5941484Z         torch.manual_seed(2025)
2025-05-07T20:32:11.5941738Z     
2025-05-07T20:32:11.5942031Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.5942390Z     
2025-05-07T20:32:11.5942595Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.5942911Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.5943242Z         x = x_sign * x_clamp
2025-05-07T20:32:11.5943494Z         x0 = x[:, :D]
2025-05-07T20:32:11.5943732Z         x1 = x[:, D:]
2025-05-07T20:32:11.5943962Z     
2025-05-07T20:32:11.5944166Z         if contiguous:
2025-05-07T20:32:11.5944420Z             x0 = x0.contiguous()
2025-05-07T20:32:11.5944700Z             x1 = x1.contiguous()
2025-05-07T20:32:11.5944956Z     
2025-05-07T20:32:11.5945168Z         if scale_ub is not None:
2025-05-07T20:32:11.5945464Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.5945817Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.5946153Z             )
2025-05-07T20:32:11.5946379Z         else:
2025-05-07T20:32:11.5946614Z             scale_ub_tensor = None
2025-05-07T20:32:11.5946880Z     
2025-05-07T20:32:11.5947132Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.5947468Z             op = silu_mul_quant
2025-05-07T20:32:11.5947729Z             if compiled:
2025-05-07T20:32:11.5947997Z                 op = torch.compile(op)
2025-05-07T20:32:11.5948312Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.5948597Z     
2025-05-07T20:32:11.5948810Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.5949123Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.5949422Z     
2025-05-07T20:32:11.5949679Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.5950032Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.5950334Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.5950664Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.5951045Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.5951369Z     
2025-05-07T20:32:11.5951584Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.5951789Z 
2025-05-07T20:32:11.5951898Z moe/activation_test.py:126: 
2025-05-07T20:32:11.5952213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.5952561Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.5952904Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.5953888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.5954649Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.5955199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.5955887Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.5956666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.5957398Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.5958125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.5958772Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.5959669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.5960202Z     fn()
2025-05-07T20:32:11.5960714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.5961305Z     self.fn.run(
2025-05-07T20:32:11.5961786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.5962324Z     kernel = self.compile(
2025-05-07T20:32:11.5962888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.5963549Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.5963957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.5964188Z 
2025-05-07T20:32:11.5964397Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a58867b30>
2025-05-07T20:32:11.5965479Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.5966914Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a586b2660>}
2025-05-07T20:32:11.5968245Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.5969257Z context = <triton._C.libtriton.ir.context object at 0x7f6a437dcff0>
2025-05-07T20:32:11.5969554Z 
2025-05-07T20:32:11.5969724Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.5970266Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.5970753Z                            module_map=module_map)
2025-05-07T20:32:11.5971121Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.5971495Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.5971776Z E       ^
2025-05-07T20:32:11.5972242Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.5972704Z 
2025-05-07T20:32:11.5973192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.5973706Z 
2025-05-07T20:32:11.5973814Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.5974245Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.5974649Z     T=1,
2025-05-07T20:32:11.5974854Z     D=5120,
2025-05-07T20:32:11.5975061Z     scale_ub=1200.0,
2025-05-07T20:32:11.5975294Z     contiguous=True,
2025-05-07T20:32:11.5975709Z     compiled=True,
2025-05-07T20:32:11.5975936Z )
2025-05-07T20:32:11.7366637Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7367385Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:11.7367735Z 
2025-05-07T20:32:11.7367844Z     @given(
2025-05-07T20:32:11.7368123Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7368847Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7369156Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7369497Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7369830Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7370125Z     )
2025-05-07T20:32:11.7370475Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7370918Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7371171Z         self,
2025-05-07T20:32:11.7371371Z         T: int,
2025-05-07T20:32:11.7371594Z         D: int,
2025-05-07T20:32:11.7371827Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7372099Z         contiguous: bool,
2025-05-07T20:32:11.7372350Z         compiled: bool,
2025-05-07T20:32:11.7372582Z     ) -> None:
2025-05-07T20:32:11.7372797Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7373133Z     
2025-05-07T20:32:11.7373417Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7373770Z     
2025-05-07T20:32:11.7373973Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.7374271Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.7374603Z         x = x_sign * x_clamp
2025-05-07T20:32:11.7374847Z         x0 = x[:, :D]
2025-05-07T20:32:11.7375071Z         x1 = x[:, D:]
2025-05-07T20:32:11.7375282Z     
2025-05-07T20:32:11.7375469Z         if contiguous:
2025-05-07T20:32:11.7375726Z             x0 = x0.contiguous()
2025-05-07T20:32:11.7375991Z             x1 = x1.contiguous()
2025-05-07T20:32:11.7376241Z     
2025-05-07T20:32:11.7376453Z         if scale_ub is not None:
2025-05-07T20:32:11.7376748Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.7377088Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.7377412Z             )
2025-05-07T20:32:11.7377618Z         else:
2025-05-07T20:32:11.7377835Z             scale_ub_tensor = None
2025-05-07T20:32:11.7378100Z     
2025-05-07T20:32:11.7378353Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.7378674Z             op = silu_mul_quant
2025-05-07T20:32:11.7378936Z             if compiled:
2025-05-07T20:32:11.7379203Z                 op = torch.compile(op)
2025-05-07T20:32:11.7379511Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.7379793Z     
2025-05-07T20:32:11.7380001Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.7380173Z 
2025-05-07T20:32:11.7380284Z moe/activation_test.py:117: 
2025-05-07T20:32:11.7380589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.7380930Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.7381215Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.7381769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.7382330Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.7382991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.7383688Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.7384221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.7384897Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.7385567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.7386245Z     kernel = self.compile(
2025-05-07T20:32:11.7386797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.7387447Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.7387851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.7388081Z 
2025-05-07T20:32:11.7388289Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43f4d6a0>
2025-05-07T20:32:11.7389440Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.7390819Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43a7d9e0>}
2025-05-07T20:32:11.7392152Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.7393165Z context = <triton._C.libtriton.ir.context object at 0x7f6a42aab430>
2025-05-07T20:32:11.7393450Z 
2025-05-07T20:32:11.7393617Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.7394144Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.7394620Z                            module_map=module_map)
2025-05-07T20:32:11.7394985Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.7395348Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.7395619Z E       ^
2025-05-07T20:32:11.7396091Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.7396537Z 
2025-05-07T20:32:11.7396961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.7397470Z 
2025-05-07T20:32:11.7397575Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7397990Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7398392Z     T=1,
2025-05-07T20:32:11.7398586Z     D=5120,
2025-05-07T20:32:11.7398795Z     scale_ub=None,
2025-05-07T20:32:11.7399028Z     contiguous=False,
2025-05-07T20:32:11.7399253Z     compiled=True,
2025-05-07T20:32:11.7399462Z )
2025-05-07T20:32:11.8011606Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.8012320Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.8012676Z 
2025-05-07T20:32:11.8012788Z     @given(
2025-05-07T20:32:11.8013124Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.8013452Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.8013779Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.8014115Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.8014458Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.8014754Z     )
2025-05-07T20:32:11.8015101Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.8015553Z     def test_silu_mul_quant(
2025-05-07T20:32:11.8015807Z         self,
2025-05-07T20:32:11.8016001Z         T: int,
2025-05-07T20:32:11.8016202Z         D: int,
2025-05-07T20:32:11.8016445Z         scale_ub: Optional[float],
2025-05-07T20:32:11.8016741Z         contiguous: bool,
2025-05-07T20:32:11.8016988Z         compiled: bool,
2025-05-07T20:32:11.8017216Z     ) -> None:
2025-05-07T20:32:11.8017435Z         torch.manual_seed(2025)
2025-05-07T20:32:11.8017685Z     
2025-05-07T20:32:11.8017963Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.8018306Z     
2025-05-07T20:32:11.8018809Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.8019115Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.8019424Z         x = x_sign * x_clamp
2025-05-07T20:32:11.8019676Z         x0 = x[:, :D]
2025-05-07T20:32:11.8019900Z         x1 = x[:, D:]
2025-05-07T20:32:11.8020112Z     
2025-05-07T20:32:11.8020308Z         if contiguous:
2025-05-07T20:32:11.8020544Z             x0 = x0.contiguous()
2025-05-07T20:32:11.8020948Z             x1 = x1.contiguous()
2025-05-07T20:32:11.8021191Z     
2025-05-07T20:32:11.8021392Z         if scale_ub is not None:
2025-05-07T20:32:11.8021672Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.8022010Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.8022323Z             )
2025-05-07T20:32:11.8022516Z         else:
2025-05-07T20:32:11.8022726Z             scale_ub_tensor = None
2025-05-07T20:32:11.8022980Z     
2025-05-07T20:32:11.8023223Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.8023536Z             op = silu_mul_quant
2025-05-07T20:32:11.8023795Z             if compiled:
2025-05-07T20:32:11.8024054Z                 op = torch.compile(op)
2025-05-07T20:32:11.8024350Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.8024631Z     
2025-05-07T20:32:11.8024832Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.8025117Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.8025420Z     
2025-05-07T20:32:11.8025667Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.8026005Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.8026302Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.8026619Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.8026983Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.8027291Z     
2025-05-07T20:32:11.8027503Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.8027702Z 
2025-05-07T20:32:11.8027814Z moe/activation_test.py:126: 
2025-05-07T20:32:11.8028111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.8028453Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.8028788Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.8029577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.8030331Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.8030877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.8031557Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.8032237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.8032956Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.8033679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.8034315Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.8034912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.8035431Z     fn()
2025-05-07T20:32:11.8035950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.8036527Z     self.fn.run(
2025-05-07T20:32:11.8036992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.8037523Z     kernel = self.compile(
2025-05-07T20:32:11.8038060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.8038795Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.8039200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.8039434Z 
2025-05-07T20:32:11.8039642Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43814740>
2025-05-07T20:32:11.8040715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.8042165Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43a7f1a0>}
2025-05-07T20:32:11.8043496Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.8044508Z context = <triton._C.libtriton.ir.context object at 0x7f6a42a0ce30>
2025-05-07T20:32:11.8044791Z 
2025-05-07T20:32:11.8044964Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.8045485Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.8045952Z                            module_map=module_map)
2025-05-07T20:32:11.8046331Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.8046744Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.8047014Z E       ^
2025-05-07T20:32:11.8047480Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.8047926Z 
2025-05-07T20:32:11.8048344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.8048848Z 
2025-05-07T20:32:11.8048967Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.8049383Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.8049787Z     T=1,
2025-05-07T20:32:11.8049976Z     D=5120,
2025-05-07T20:32:11.8050175Z     scale_ub=None,
2025-05-07T20:32:11.8050404Z     contiguous=True,
2025-05-07T20:32:11.8050638Z     compiled=False,
2025-05-07T20:32:11.8050852Z )
2025-05-07T20:32:11.9549368Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.9550056Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.9550412Z 
2025-05-07T20:32:11.9550527Z     @given(
2025-05-07T20:32:11.9550825Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.9551236Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.9551606Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.9551931Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.9552282Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.9552567Z     )
2025-05-07T20:32:11.9552909Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.9553350Z     def test_silu_mul_quant(
2025-05-07T20:32:11.9553604Z         self,
2025-05-07T20:32:11.9553800Z         T: int,
2025-05-07T20:32:11.9553995Z         D: int,
2025-05-07T20:32:11.9554229Z         scale_ub: Optional[float],
2025-05-07T20:32:11.9554498Z         contiguous: bool,
2025-05-07T20:32:11.9554738Z         compiled: bool,
2025-05-07T20:32:11.9554978Z     ) -> None:
2025-05-07T20:32:11.9555197Z         torch.manual_seed(2025)
2025-05-07T20:32:11.9555439Z     
2025-05-07T20:32:11.9555723Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.9556067Z     
2025-05-07T20:32:11.9556255Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.9556557Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.9557219Z         x = x_sign * x_clamp
2025-05-07T20:32:11.9557459Z         x0 = x[:, :D]
2025-05-07T20:32:11.9557683Z         x1 = x[:, D:]
2025-05-07T20:32:11.9557897Z     
2025-05-07T20:32:11.9558082Z         if contiguous:
2025-05-07T20:32:11.9558319Z             x0 = x0.contiguous()
2025-05-07T20:32:11.9558582Z             x1 = x1.contiguous()
2025-05-07T20:32:11.9558822Z     
2025-05-07T20:32:11.9559020Z         if scale_ub is not None:
2025-05-07T20:32:11.9559865Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.9560213Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.9560518Z             )
2025-05-07T20:32:11.9560735Z         else:
2025-05-07T20:32:11.9560957Z             scale_ub_tensor = None
2025-05-07T20:32:11.9561210Z     
2025-05-07T20:32:11.9561459Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.9561800Z             op = silu_mul_quant
2025-05-07T20:32:11.9570304Z             if compiled:
2025-05-07T20:32:11.9570585Z                 op = torch.compile(op)
2025-05-07T20:32:11.9570890Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.9571162Z     
2025-05-07T20:32:11.9571360Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.9571523Z 
2025-05-07T20:32:11.9571631Z moe/activation_test.py:117: 
2025-05-07T20:32:11.9571922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.9572261Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.9572552Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.9573326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.9574011Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.9574548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.9575223Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.9575883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.9576413Z     kernel = self.compile(
2025-05-07T20:32:11.9576960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.9577599Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.9578007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.9578246Z 
2025-05-07T20:32:11.9578451Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43e24800>
2025-05-07T20:32:11.9579521Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.9580969Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43a7e700>}
2025-05-07T20:32:11.9582291Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.9583294Z context = <triton._C.libtriton.ir.context object at 0x7f6a432ce8f0>
2025-05-07T20:32:11.9583589Z 
2025-05-07T20:32:11.9583755Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.9584295Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.9584769Z                            module_map=module_map)
2025-05-07T20:32:11.9585143Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.9585501Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.9585764Z E       ^
2025-05-07T20:32:11.9586434Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.9586878Z 
2025-05-07T20:32:11.9587288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.9587797Z 
2025-05-07T20:32:11.9587905Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.9588319Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.9588843Z     T=128,
2025-05-07T20:32:11.9589028Z     D=5120,
2025-05-07T20:32:11.9589224Z     scale_ub=None,
2025-05-07T20:32:11.9589448Z     contiguous=False,
2025-05-07T20:32:11.9589673Z     compiled=True,
2025-05-07T20:32:11.9589893Z )
2025-05-07T20:32:11.9590213Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.9590692Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.9590960Z 
2025-05-07T20:32:11.9591043Z     @given(
2025-05-07T20:32:11.9591280Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.9591584Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.9591889Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.9592214Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.9592539Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.9592823Z     )
2025-05-07T20:32:11.9593169Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.9593609Z     def test_silu_mul_quant(
2025-05-07T20:32:11.9593845Z         self,
2025-05-07T20:32:11.9594047Z         T: int,
2025-05-07T20:32:11.9594251Z         D: int,
2025-05-07T20:32:11.9594469Z         scale_ub: Optional[float],
2025-05-07T20:32:11.9594733Z         contiguous: bool,
2025-05-07T20:32:11.9594981Z         compiled: bool,
2025-05-07T20:32:11.9595193Z     ) -> None:
2025-05-07T20:32:11.9595415Z         torch.manual_seed(2025)
2025-05-07T20:32:11.9595654Z     
2025-05-07T20:32:11.9595924Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.9596267Z     
2025-05-07T20:32:11.9596470Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.9596754Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.9597068Z         x = x_sign * x_clamp
2025-05-07T20:32:11.9597314Z         x0 = x[:, :D]
2025-05-07T20:32:11.9597544Z         x1 = x[:, D:]
2025-05-07T20:32:11.9597747Z     
2025-05-07T20:32:11.9597941Z         if contiguous:
2025-05-07T20:32:11.9598176Z             x0 = x0.contiguous()
2025-05-07T20:32:11.9598427Z             x1 = x1.contiguous()
2025-05-07T20:32:11.9598677Z     
2025-05-07T20:32:11.9598878Z         if scale_ub is not None:
2025-05-07T20:32:11.9599146Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.9599490Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.9599802Z             )
2025-05-07T20:32:11.9600001Z         else:
2025-05-07T20:32:11.9600227Z             scale_ub_tensor = None
2025-05-07T20:32:11.9600484Z     
2025-05-07T20:32:11.9600715Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.9601031Z             op = silu_mul_quant
2025-05-07T20:32:11.9601282Z             if compiled:
2025-05-07T20:32:11.9601528Z                 op = torch.compile(op)
2025-05-07T20:32:11.9601826Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.9602107Z     
2025-05-07T20:32:11.9602305Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.9602466Z 
2025-05-07T20:32:11.9602566Z moe/activation_test.py:117: 
2025-05-07T20:32:11.9602863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.9603197Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.9603477Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.9604030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.9604669Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.9605312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.9605989Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.9606522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.9607314Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.9607961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.9608486Z     kernel = self.compile(
2025-05-07T20:32:11.9609025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.9609668Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.9610063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.9610295Z 
2025-05-07T20:32:11.9610500Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a4387d760>
2025-05-07T20:32:11.9611567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.9612922Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43a7df80>}
2025-05-07T20:32:11.9614292Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.9615303Z context = <triton._C.libtriton.ir.context object at 0x7f6a432c0030>
2025-05-07T20:32:11.9615599Z 
2025-05-07T20:32:11.9615763Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.9616284Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.9616744Z                            module_map=module_map)
2025-05-07T20:32:11.9617116Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.9617477Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.9617730Z E       ^
2025-05-07T20:32:11.9618193Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.9618640Z 
2025-05-07T20:32:11.9619050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.9619548Z 
2025-05-07T20:32:11.9619664Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.9620074Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.9620472Z     T=128,
2025-05-07T20:32:11.9620669Z     D=7168,
2025-05-07T20:32:11.9620874Z     scale_ub=1200.0,
2025-05-07T20:32:11.9621096Z     contiguous=False,
2025-05-07T20:32:11.9621334Z     compiled=False,
2025-05-07T20:32:11.9621545Z )
2025-05-07T20:32:12.0750244Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.0751034Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:12.0751425Z 
2025-05-07T20:32:12.0751538Z     @given(
2025-05-07T20:32:12.0751826Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.0752148Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.0752455Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.0752789Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.0753129Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.0753411Z     )
2025-05-07T20:32:12.0754037Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.0754487Z     def test_silu_mul_quant(
2025-05-07T20:32:12.0754729Z         self,
2025-05-07T20:32:12.0754938Z         T: int,
2025-05-07T20:32:12.0755150Z         D: int,
2025-05-07T20:32:12.0755378Z         scale_ub: Optional[float],
2025-05-07T20:32:12.0755650Z         contiguous: bool,
2025-05-07T20:32:12.0755895Z         compiled: bool,
2025-05-07T20:32:12.0756272Z     ) -> None:
2025-05-07T20:32:12.0756497Z         torch.manual_seed(2025)
2025-05-07T20:32:12.0756749Z     
2025-05-07T20:32:12.0757023Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.0757362Z     
2025-05-07T20:32:12.0757565Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.0757859Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.0758169Z         x = x_sign * x_clamp
2025-05-07T20:32:12.0758414Z         x0 = x[:, :D]
2025-05-07T20:32:12.0758642Z         x1 = x[:, D:]
2025-05-07T20:32:12.0758850Z     
2025-05-07T20:32:12.0759036Z         if contiguous:
2025-05-07T20:32:12.0759586Z             x0 = x0.contiguous()
2025-05-07T20:32:12.0759932Z             x1 = x1.contiguous()
2025-05-07T20:32:12.0760190Z     
2025-05-07T20:32:12.0760394Z         if scale_ub is not None:
2025-05-07T20:32:12.0760678Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.0761022Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.0761347Z             )
2025-05-07T20:32:12.0761541Z         else:
2025-05-07T20:32:12.0761760Z             scale_ub_tensor = None
2025-05-07T20:32:12.0762031Z     
2025-05-07T20:32:12.0762271Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.0762593Z             op = silu_mul_quant
2025-05-07T20:32:12.0762862Z             if compiled:
2025-05-07T20:32:12.0763118Z                 op = torch.compile(op)
2025-05-07T20:32:12.0763415Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.0763709Z     
2025-05-07T20:32:12.0763909Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.0764076Z 
2025-05-07T20:32:12.0764180Z moe/activation_test.py:117: 
2025-05-07T20:32:12.0764483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.0764821Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.0765107Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.0765820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.0766522Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.0767058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.0767749Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.0768420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.0768962Z     kernel = self.compile(
2025-05-07T20:32:12.0769506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.0770164Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.0770568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.0770803Z 
2025-05-07T20:32:12.0771020Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a4387e060>
2025-05-07T20:32:12.0772089Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.0773529Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43450360>}
2025-05-07T20:32:12.0775034Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.0776059Z context = <triton._C.libtriton.ir.context object at 0x7f6a432c6f70>
2025-05-07T20:32:12.0776345Z 
2025-05-07T20:32:12.0776517Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.0777186Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.0777662Z                            module_map=module_map)
2025-05-07T20:32:12.0778038Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.0778398Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.0778668Z E       ^
2025-05-07T20:32:12.0779139Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.0779593Z 
2025-05-07T20:32:12.0780013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.0780520Z 
2025-05-07T20:32:12.0780631Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.0781048Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.0781456Z     T=128,
2025-05-07T20:32:12.0781653Z     D=5120,
2025-05-07T20:32:12.0781857Z     scale_ub=None,
2025-05-07T20:32:12.0782080Z     contiguous=False,
2025-05-07T20:32:12.0782310Z     compiled=False,
2025-05-07T20:32:12.0782530Z )
2025-05-07T20:32:12.0782856Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.0783345Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:12.0783623Z 
2025-05-07T20:32:12.0783704Z     @given(
2025-05-07T20:32:12.0783950Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.0784281Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.0784587Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.0784929Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.0785269Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.0785554Z     )
2025-05-07T20:32:12.0785908Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.0786362Z     def test_silu_mul_quant(
2025-05-07T20:32:12.0786604Z         self,
2025-05-07T20:32:12.0786813Z         T: int,
2025-05-07T20:32:12.0787021Z         D: int,
2025-05-07T20:32:12.0787242Z         scale_ub: Optional[float],
2025-05-07T20:32:12.0787529Z         contiguous: bool,
2025-05-07T20:32:12.0787782Z         compiled: bool,
2025-05-07T20:32:12.0788013Z     ) -> None:
2025-05-07T20:32:12.0788237Z         torch.manual_seed(2025)
2025-05-07T20:32:12.0788489Z     
2025-05-07T20:32:12.0788779Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.0789127Z     
2025-05-07T20:32:12.0789337Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.0789642Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.0789951Z         x = x_sign * x_clamp
2025-05-07T20:32:12.0790207Z         x0 = x[:, :D]
2025-05-07T20:32:12.0790442Z         x1 = x[:, D:]
2025-05-07T20:32:12.0790654Z     
2025-05-07T20:32:12.0790861Z         if contiguous:
2025-05-07T20:32:12.0791116Z             x0 = x0.contiguous()
2025-05-07T20:32:12.0791379Z             x1 = x1.contiguous()
2025-05-07T20:32:12.0791636Z     
2025-05-07T20:32:12.0791841Z         if scale_ub is not None:
2025-05-07T20:32:12.0792117Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.0792459Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.0792778Z             )
2025-05-07T20:32:12.0792975Z         else:
2025-05-07T20:32:12.0793204Z             scale_ub_tensor = None
2025-05-07T20:32:12.0793468Z     
2025-05-07T20:32:12.0793794Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.0794112Z             op = silu_mul_quant
2025-05-07T20:32:12.0794376Z             if compiled:
2025-05-07T20:32:12.0794635Z                 op = torch.compile(op)
2025-05-07T20:32:12.0794934Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.0795217Z     
2025-05-07T20:32:12.0795426Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.0795668Z 
2025-05-07T20:32:12.0795775Z moe/activation_test.py:117: 
2025-05-07T20:32:12.0796078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.0796428Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.0796751Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.0797438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.0798132Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.0798676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.0799351Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.0800013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.0800548Z     kernel = self.compile(
2025-05-07T20:32:12.0801093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.0801748Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.0802150Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.0802379Z 
2025-05-07T20:32:12.0802596Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a434746b0>
2025-05-07T20:32:12.0803664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.0805024Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a434514e0>}
2025-05-07T20:32:12.0806348Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.0807365Z context = <triton._C.libtriton.ir.context object at 0x7f6a42c62a30>
2025-05-07T20:32:12.0807653Z 
2025-05-07T20:32:12.0807829Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.0808343Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.0808817Z                            module_map=module_map)
2025-05-07T20:32:12.0809200Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.0809567Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.0809835Z E       ^
2025-05-07T20:32:12.0810307Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.0810755Z 
2025-05-07T20:32:12.0811175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.0811686Z 
2025-05-07T20:32:12.0811799Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.0812219Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.0812630Z     T=128,
2025-05-07T20:32:12.0812828Z     D=5120,
2025-05-07T20:32:12.0813132Z     scale_ub=1200.0,
2025-05-07T20:32:12.0813369Z     contiguous=True,
2025-05-07T20:32:12.0813606Z     compiled=False,
2025-05-07T20:32:12.0813816Z )
2025-05-07T20:32:12.4546585Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4547348Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:12.4547689Z 
2025-05-07T20:32:12.4547775Z     @given(
2025-05-07T20:32:12.4548015Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4548334Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4548652Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4549153Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4549478Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4549772Z     )
2025-05-07T20:32:12.4550123Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4550568Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4550815Z         self,
2025-05-07T20:32:12.4551016Z         T: int,
2025-05-07T20:32:12.4551218Z         D: int,
2025-05-07T20:32:12.4551444Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4551718Z         contiguous: bool,
2025-05-07T20:32:12.4551962Z         compiled: bool,
2025-05-07T20:32:12.4552193Z     ) -> None:
2025-05-07T20:32:12.4552417Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4552661Z     
2025-05-07T20:32:12.4552932Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4553277Z     
2025-05-07T20:32:12.4553484Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4553785Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4554096Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4554347Z         x0 = x[:, :D]
2025-05-07T20:32:12.4554563Z         x1 = x[:, D:]
2025-05-07T20:32:12.4554779Z     
2025-05-07T20:32:12.4554972Z         if contiguous:
2025-05-07T20:32:12.4555201Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4555470Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4555720Z     
2025-05-07T20:32:12.4555925Z         if scale_ub is not None:
2025-05-07T20:32:12.4556205Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4556548Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4556904Z             )
2025-05-07T20:32:12.4557113Z         else:
2025-05-07T20:32:12.4557343Z             scale_ub_tensor = None
2025-05-07T20:32:12.4557614Z     
2025-05-07T20:32:12.4557851Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4558179Z             op = silu_mul_quant
2025-05-07T20:32:12.4558435Z             if compiled:
2025-05-07T20:32:12.4558679Z                 op = torch.compile(op)
2025-05-07T20:32:12.4558977Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4559590Z     
2025-05-07T20:32:12.4559848Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4560020Z 
2025-05-07T20:32:12.4560121Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4560422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4560750Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4561032Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4561715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4562395Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4562919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4563600Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4564250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4564778Z     kernel = self.compile(
2025-05-07T20:32:12.4565309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4565953Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4566480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4566708Z 
2025-05-07T20:32:12.4566924Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43475640>
2025-05-07T20:32:12.4567980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4569460Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43453560>}
2025-05-07T20:32:12.4570779Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4571789Z context = <triton._C.libtriton.ir.context object at 0x7f6a428ea630>
2025-05-07T20:32:12.4572071Z 
2025-05-07T20:32:12.4572236Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4572750Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4573303Z                            module_map=module_map)
2025-05-07T20:32:12.4573669Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4574019Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4574282Z E       ^
2025-05-07T20:32:12.4574744Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4575184Z 
2025-05-07T20:32:12.4575593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4576100Z 
2025-05-07T20:32:12.4576205Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4576619Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4577014Z     T=1,
2025-05-07T20:32:12.4577195Z     D=7168,
2025-05-07T20:32:12.4577409Z     scale_ub=1200.0,
2025-05-07T20:32:12.4577644Z     contiguous=True,
2025-05-07T20:32:12.4577865Z     compiled=True,
2025-05-07T20:32:12.4578078Z )
2025-05-07T20:32:12.4578398Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4578884Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.4579154Z 
2025-05-07T20:32:12.4587106Z     @given(
2025-05-07T20:32:12.4587380Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4587713Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4588029Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4588366Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4588696Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4588985Z     )
2025-05-07T20:32:12.4589343Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4589780Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4590030Z         self,
2025-05-07T20:32:12.4590233Z         T: int,
2025-05-07T20:32:12.4590427Z         D: int,
2025-05-07T20:32:12.4590647Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4590919Z         contiguous: bool,
2025-05-07T20:32:12.4591157Z         compiled: bool,
2025-05-07T20:32:12.4591389Z     ) -> None:
2025-05-07T20:32:12.4591615Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4591856Z     
2025-05-07T20:32:12.4592135Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4592480Z     
2025-05-07T20:32:12.4592674Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4592972Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4593287Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4593537Z         x0 = x[:, :D]
2025-05-07T20:32:12.4594379Z         x1 = x[:, D:]
2025-05-07T20:32:12.4594603Z     
2025-05-07T20:32:12.4594799Z         if contiguous:
2025-05-07T20:32:12.4595030Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4595294Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4595540Z     
2025-05-07T20:32:12.4595736Z         if scale_ub is not None:
2025-05-07T20:32:12.4596021Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4596362Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4596745Z             )
2025-05-07T20:32:12.4596946Z         else:
2025-05-07T20:32:12.4597167Z             scale_ub_tensor = None
2025-05-07T20:32:12.4597422Z     
2025-05-07T20:32:12.4597662Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4597984Z             op = silu_mul_quant
2025-05-07T20:32:12.4598233Z             if compiled:
2025-05-07T20:32:12.4598489Z                 op = torch.compile(op)
2025-05-07T20:32:12.4598796Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4599087Z     
2025-05-07T20:32:12.4599277Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4599450Z 
2025-05-07T20:32:12.4599554Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4599860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4600188Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4600473Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4601041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4601595Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4602259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4602937Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4603471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4604140Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4604798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4605322Z     kernel = self.compile(
2025-05-07T20:32:12.4605861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4606505Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4606900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4607126Z 
2025-05-07T20:32:12.4607339Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43476bd0>
2025-05-07T20:32:12.4608405Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4609762Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a42ce42c0>}
2025-05-07T20:32:12.4611089Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4612112Z context = <triton._C.libtriton.ir.context object at 0x7f6a42c0edb0>
2025-05-07T20:32:12.4612397Z 
2025-05-07T20:32:12.4612571Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4613177Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4613643Z                            module_map=module_map)
2025-05-07T20:32:12.4614012Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4614450Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4614716Z E       ^
2025-05-07T20:32:12.4615183Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4615627Z 
2025-05-07T20:32:12.4616047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4616557Z 
2025-05-07T20:32:12.4616738Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4617157Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4617563Z     T=1,
2025-05-07T20:32:12.4617762Z     D=7168,
2025-05-07T20:32:12.4617957Z     scale_ub=1200.0,
2025-05-07T20:32:12.4618190Z     contiguous=False,
2025-05-07T20:32:12.4618428Z     compiled=True,
2025-05-07T20:32:12.4618636Z )
2025-05-07T20:32:12.6023830Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.6024528Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.6024802Z 
2025-05-07T20:32:12.6024895Z     @given(
2025-05-07T20:32:12.6025132Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.6025460Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.6025905Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.6026307Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.6026760Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.6027062Z     )
2025-05-07T20:32:12.6027426Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.6027872Z     def test_silu_mul_quant(
2025-05-07T20:32:12.6028129Z         self,
2025-05-07T20:32:12.6028338Z         T: int,
2025-05-07T20:32:12.6028553Z         D: int,
2025-05-07T20:32:12.6028790Z         scale_ub: Optional[float],
2025-05-07T20:32:12.6029076Z         contiguous: bool,
2025-05-07T20:32:12.6029315Z         compiled: bool,
2025-05-07T20:32:12.6029569Z     ) -> None:
2025-05-07T20:32:12.6029797Z         torch.manual_seed(2025)
2025-05-07T20:32:12.6030045Z     
2025-05-07T20:32:12.6030332Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.6030679Z     
2025-05-07T20:32:12.6030878Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.6031183Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.6031511Z         x = x_sign * x_clamp
2025-05-07T20:32:12.6031781Z         x0 = x[:, :D]
2025-05-07T20:32:12.6031998Z         x1 = x[:, D:]
2025-05-07T20:32:12.6032264Z     
2025-05-07T20:32:12.6032543Z         if contiguous:
2025-05-07T20:32:12.6032879Z             x0 = x0.contiguous()
2025-05-07T20:32:12.6033273Z             x1 = x1.contiguous()
2025-05-07T20:32:12.6033641Z     
2025-05-07T20:32:12.6033925Z         if scale_ub is not None:
2025-05-07T20:32:12.6034331Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.6034816Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.6035260Z             )
2025-05-07T20:32:12.6035553Z         else:
2025-05-07T20:32:12.6035863Z             scale_ub_tensor = None
2025-05-07T20:32:12.6036223Z     
2025-05-07T20:32:12.6036471Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.6036812Z             op = silu_mul_quant
2025-05-07T20:32:12.6037088Z             if compiled:
2025-05-07T20:32:12.6037344Z                 op = torch.compile(op)
2025-05-07T20:32:12.6037649Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.6037931Z     
2025-05-07T20:32:12.6038125Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.6038300Z 
2025-05-07T20:32:12.6038401Z moe/activation_test.py:117: 
2025-05-07T20:32:12.6038705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.6039037Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.6039326Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.6040190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.6040754Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.6041416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.6042104Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.6042646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.6043471Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.6044139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.6044683Z     kernel = self.compile(
2025-05-07T20:32:12.6045234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.6045889Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.6046293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.6046522Z 
2025-05-07T20:32:12.6046737Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43477560>
2025-05-07T20:32:12.6047824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.6049206Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a42ce5120>}
2025-05-07T20:32:12.6050536Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.6051557Z context = <triton._C.libtriton.ir.context object at 0x7f6a429148b0>
2025-05-07T20:32:12.6051841Z 
2025-05-07T20:32:12.6052014Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.6052526Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.6053110Z                            module_map=module_map)
2025-05-07T20:32:12.6053483Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.6053849Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.6054107Z E       ^
2025-05-07T20:32:12.6054572Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.6055017Z 
2025-05-07T20:32:12.6055439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.6055944Z 
2025-05-07T20:32:12.6056057Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.6056472Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.6056904Z     T=1,
2025-05-07T20:32:12.6057115Z     D=7168,
2025-05-07T20:32:12.6057302Z     scale_ub=None,
2025-05-07T20:32:12.6057524Z     contiguous=False,
2025-05-07T20:32:12.6057759Z     compiled=True,
2025-05-07T20:32:12.6057964Z )
2025-05-07T20:32:12.6959849Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.6960653Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:12.6960917Z 
2025-05-07T20:32:12.6961006Z     @given(
2025-05-07T20:32:12.6961239Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.6961557Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.6961862Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.6962189Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.6962508Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.6963150Z     )
2025-05-07T20:32:12.6963505Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.6963942Z     def test_silu_mul_quant(
2025-05-07T20:32:12.6964210Z         self,
2025-05-07T20:32:12.6964413Z         T: int,
2025-05-07T20:32:12.6964609Z         D: int,
2025-05-07T20:32:12.6964835Z         scale_ub: Optional[float],
2025-05-07T20:32:12.6965111Z         contiguous: bool,
2025-05-07T20:32:12.6965494Z         compiled: bool,
2025-05-07T20:32:12.6965729Z     ) -> None:
2025-05-07T20:32:12.6965951Z         torch.manual_seed(2025)
2025-05-07T20:32:12.6966195Z     
2025-05-07T20:32:12.6966480Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.6966835Z     
2025-05-07T20:32:12.6967025Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.6967319Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.6967641Z         x = x_sign * x_clamp
2025-05-07T20:32:12.6967931Z         x0 = x[:, :D]
2025-05-07T20:32:12.6968148Z         x1 = x[:, D:]
2025-05-07T20:32:12.6968363Z     
2025-05-07T20:32:12.6968544Z         if contiguous:
2025-05-07T20:32:12.6968781Z             x0 = x0.contiguous()
2025-05-07T20:32:12.6969042Z             x1 = x1.contiguous()
2025-05-07T20:32:12.6969281Z     
2025-05-07T20:32:12.6969468Z         if scale_ub is not None:
2025-05-07T20:32:12.6969745Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.6970087Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.6970396Z             )
2025-05-07T20:32:12.6970596Z         else:
2025-05-07T20:32:12.6970807Z             scale_ub_tensor = None
2025-05-07T20:32:12.6971050Z     
2025-05-07T20:32:12.6971279Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.6971597Z             op = silu_mul_quant
2025-05-07T20:32:12.6971845Z             if compiled:
2025-05-07T20:32:12.6972096Z                 op = torch.compile(op)
2025-05-07T20:32:12.6972398Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.6972662Z     
2025-05-07T20:32:12.6972855Z         y_fp8, y_scale = fn()
2025-05-07T20:32:12.6973207Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:12.6973486Z     
2025-05-07T20:32:12.6973723Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.6974053Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:12.6974349Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:12.6974652Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:12.6975007Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.6975315Z     
2025-05-07T20:32:12.6975511Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:12.6975710Z 
2025-05-07T20:32:12.6975810Z moe/activation_test.py:126: 
2025-05-07T20:32:12.6976109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.6976442Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:12.6976768Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.6977600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:12.6978352Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:12.6978889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.6979568Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.6980250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:12.6980965Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.6981679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:12.6982396Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:12.6982992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:12.6983493Z     fn()
2025-05-07T20:32:12.6984000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:12.6984653Z     self.fn.run(
2025-05-07T20:32:12.6985125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.6985642Z     kernel = self.compile(
2025-05-07T20:32:12.6986183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.6986835Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.6987224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.6987461Z 
2025-05-07T20:32:12.6987673Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43808a40>
2025-05-07T20:32:12.6988741Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.6990114Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a438d7420>}
2025-05-07T20:32:12.6991434Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.6992435Z context = <triton._C.libtriton.ir.context object at 0x7f6a585518b0>
2025-05-07T20:32:12.6992724Z 
2025-05-07T20:32:12.6992895Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.6993409Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.6993874Z                            module_map=module_map)
2025-05-07T20:32:12.6994232Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.6994589Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:12.6994865Z E       ^
2025-05-07T20:32:12.6995334Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.6995781Z 
2025-05-07T20:32:12.6996189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.6996695Z 
2025-05-07T20:32:12.6996799Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.6997260Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.6997654Z     T=1,
2025-05-07T20:32:12.6997849Z     D=5120,
2025-05-07T20:32:12.6998055Z     scale_ub=1200.0,
2025-05-07T20:32:12.6998277Z     contiguous=False,
2025-05-07T20:32:12.6998506Z     compiled=True,
2025-05-07T20:32:12.6998719Z )
2025-05-07T20:32:12.8525254Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.8525785Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.8526065Z 
2025-05-07T20:32:12.8526179Z     @given(
2025-05-07T20:32:12.8526418Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.8526730Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.8527071Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.8527428Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.8527757Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.8528052Z     )
2025-05-07T20:32:12.8528403Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.8529169Z     def test_silu_mul_quant(
2025-05-07T20:32:12.8529420Z         self,
2025-05-07T20:32:12.8529631Z         T: int,
2025-05-07T20:32:12.8529835Z         D: int,
2025-05-07T20:32:12.8530060Z         scale_ub: Optional[float],
2025-05-07T20:32:12.8530339Z         contiguous: bool,
2025-05-07T20:32:12.8530580Z         compiled: bool,
2025-05-07T20:32:12.8530807Z     ) -> None:
2025-05-07T20:32:12.8531032Z         torch.manual_seed(2025)
2025-05-07T20:32:12.8531431Z     
2025-05-07T20:32:12.8531703Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.8532051Z     
2025-05-07T20:32:12.8532254Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.8532540Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.8532851Z         x = x_sign * x_clamp
2025-05-07T20:32:12.8533212Z         x0 = x[:, :D]
2025-05-07T20:32:12.8533433Z         x1 = x[:, D:]
2025-05-07T20:32:12.8533640Z     
2025-05-07T20:32:12.8533820Z         if contiguous:
2025-05-07T20:32:12.8534057Z             x0 = x0.contiguous()
2025-05-07T20:32:12.8534315Z             x1 = x1.contiguous()
2025-05-07T20:32:12.8534548Z     
2025-05-07T20:32:12.8534738Z         if scale_ub is not None:
2025-05-07T20:32:12.8535010Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.8535341Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.8535649Z             )
2025-05-07T20:32:12.8535852Z         else:
2025-05-07T20:32:12.8536057Z             scale_ub_tensor = None
2025-05-07T20:32:12.8536314Z     
2025-05-07T20:32:12.8536552Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.8536869Z             op = silu_mul_quant
2025-05-07T20:32:12.8537112Z             if compiled:
2025-05-07T20:32:12.8537363Z                 op = torch.compile(op)
2025-05-07T20:32:12.8537659Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.8537929Z     
2025-05-07T20:32:12.8538125Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.8538295Z 
2025-05-07T20:32:12.8538402Z moe/activation_test.py:117: 
2025-05-07T20:32:12.8538694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.8539024Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.8539304Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.8539859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.8540421Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.8541076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.8541756Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.8542282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.8542962Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.8543628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.8544160Z     kernel = self.compile(
2025-05-07T20:32:12.8544698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.8545351Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.8545760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.8545989Z 
2025-05-07T20:32:12.8546201Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a4380b290>
2025-05-07T20:32:12.8547278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.8548748Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43a7fa60>}
2025-05-07T20:32:12.8550076Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.8551088Z context = <triton._C.libtriton.ir.context object at 0x7f6a428f63b0>
2025-05-07T20:32:12.8551448Z 
2025-05-07T20:32:12.8551612Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.8552130Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.8552599Z                            module_map=module_map)
2025-05-07T20:32:12.8552966Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.8553315Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.8553581Z E       ^
2025-05-07T20:32:12.8554050Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.8554493Z 
2025-05-07T20:32:12.8554904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.8555416Z 
2025-05-07T20:32:12.8555519Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.8555931Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.8556336Z     T=1,
2025-05-07T20:32:12.8556516Z     D=5120,
2025-05-07T20:32:12.8556713Z     scale_ub=1200.0,
2025-05-07T20:32:12.8556946Z     contiguous=False,
2025-05-07T20:32:12.8557207Z     compiled=False,
2025-05-07T20:32:12.8557421Z )
2025-05-07T20:32:12.8557744Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.8558223Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:12.8558497Z 
2025-05-07T20:32:12.8558577Z     @given(
2025-05-07T20:32:12.8558819Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.8559126Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.8559687Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.8560022Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.8560353Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.8560640Z     )
2025-05-07T20:32:12.8561004Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.8561444Z     def test_silu_mul_quant(
2025-05-07T20:32:12.8561680Z         self,
2025-05-07T20:32:12.8561884Z         T: int,
2025-05-07T20:32:12.8562089Z         D: int,
2025-05-07T20:32:12.8562306Z         scale_ub: Optional[float],
2025-05-07T20:32:12.8562581Z         contiguous: bool,
2025-05-07T20:32:12.8562826Z         compiled: bool,
2025-05-07T20:32:12.8563041Z     ) -> None:
2025-05-07T20:32:12.8563257Z         torch.manual_seed(2025)
2025-05-07T20:32:12.8563522Z     
2025-05-07T20:32:12.8563797Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.8564150Z     
2025-05-07T20:32:12.8564354Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.8564656Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.8564963Z         x = x_sign * x_clamp
2025-05-07T20:32:12.8565212Z         x0 = x[:, :D]
2025-05-07T20:32:12.8573789Z         x1 = x[:, D:]
2025-05-07T20:32:12.8574024Z     
2025-05-07T20:32:12.8574221Z         if contiguous:
2025-05-07T20:32:12.8574465Z             x0 = x0.contiguous()
2025-05-07T20:32:12.8574720Z             x1 = x1.contiguous()
2025-05-07T20:32:12.8574964Z     
2025-05-07T20:32:12.8575160Z         if scale_ub is not None:
2025-05-07T20:32:12.8575435Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.8575776Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.8576097Z             )
2025-05-07T20:32:12.8576298Z         else:
2025-05-07T20:32:12.8576691Z             scale_ub_tensor = None
2025-05-07T20:32:12.8576957Z     
2025-05-07T20:32:12.8577208Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.8577527Z             op = silu_mul_quant
2025-05-07T20:32:12.8577787Z             if compiled:
2025-05-07T20:32:12.8578041Z                 op = torch.compile(op)
2025-05-07T20:32:12.8578333Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.8578744Z     
2025-05-07T20:32:12.8578946Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.8579109Z 
2025-05-07T20:32:12.8579211Z moe/activation_test.py:117: 
2025-05-07T20:32:12.8579512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.8579844Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.8580127Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.8580811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.8581501Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.8582038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.8582708Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.8583376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.8583918Z     kernel = self.compile(
2025-05-07T20:32:12.8584463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.8585104Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.8585508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.8585732Z 
2025-05-07T20:32:12.8585952Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a438156d0>
2025-05-07T20:32:12.8587029Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.8588379Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43e0f6a0>}
2025-05-07T20:32:12.8589709Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.8590721Z context = <triton._C.libtriton.ir.context object at 0x7f6a428e69b0>
2025-05-07T20:32:12.8591005Z 
2025-05-07T20:32:12.8591181Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.8591696Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.8592166Z                            module_map=module_map)
2025-05-07T20:32:12.8592539Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.8592895Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.8593151Z E       ^
2025-05-07T20:32:12.8593622Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.8594068Z 
2025-05-07T20:32:12.8594491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.8594994Z 
2025-05-07T20:32:12.8595116Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.8595525Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.8595933Z     T=16384,
2025-05-07T20:32:12.8596136Z     D=5120,
2025-05-07T20:32:12.8596325Z     scale_ub=1200.0,
2025-05-07T20:32:12.8596555Z     contiguous=False,
2025-05-07T20:32:12.8596870Z     compiled=True,
2025-05-07T20:32:12.8597074Z )
2025-05-07T20:32:12.9469469Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.9470001Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.9470289Z 
2025-05-07T20:32:12.9470381Z     @given(
2025-05-07T20:32:12.9470624Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.9471261Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.9471581Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.9471921Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.9472260Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.9472558Z     )
2025-05-07T20:32:12.9472911Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.9473370Z     def test_silu_mul_quant(
2025-05-07T20:32:12.9473622Z         self,
2025-05-07T20:32:12.9473846Z         T: int,
2025-05-07T20:32:12.9474045Z         D: int,
2025-05-07T20:32:12.9474271Z         scale_ub: Optional[float],
2025-05-07T20:32:12.9474555Z         contiguous: bool,
2025-05-07T20:32:12.9474800Z         compiled: bool,
2025-05-07T20:32:12.9475048Z     ) -> None:
2025-05-07T20:32:12.9475282Z         torch.manual_seed(2025)
2025-05-07T20:32:12.9475527Z     
2025-05-07T20:32:12.9475816Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.9476170Z     
2025-05-07T20:32:12.9476369Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.9476670Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.9476983Z         x = x_sign * x_clamp
2025-05-07T20:32:12.9477219Z         x0 = x[:, :D]
2025-05-07T20:32:12.9477473Z         x1 = x[:, D:]
2025-05-07T20:32:12.9477682Z     
2025-05-07T20:32:12.9477872Z         if contiguous:
2025-05-07T20:32:12.9478110Z             x0 = x0.contiguous()
2025-05-07T20:32:12.9478369Z             x1 = x1.contiguous()
2025-05-07T20:32:12.9478619Z     
2025-05-07T20:32:12.9478824Z         if scale_ub is not None:
2025-05-07T20:32:12.9479100Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.9479442Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.9479760Z             )
2025-05-07T20:32:12.9479952Z         else:
2025-05-07T20:32:12.9480176Z             scale_ub_tensor = None
2025-05-07T20:32:12.9480445Z     
2025-05-07T20:32:12.9480688Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.9481016Z             op = silu_mul_quant
2025-05-07T20:32:12.9481280Z             if compiled:
2025-05-07T20:32:12.9481540Z                 op = torch.compile(op)
2025-05-07T20:32:12.9481836Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.9482125Z     
2025-05-07T20:32:12.9482329Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.9482500Z 
2025-05-07T20:32:12.9482602Z moe/activation_test.py:117: 
2025-05-07T20:32:12.9482915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.9483252Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.9483530Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.9484092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.9484659Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.9485328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.9486017Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.9486556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.9487242Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.9487946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.9488637Z     kernel = self.compile(
2025-05-07T20:32:12.9489192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.9489853Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.9490256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.9490498Z 
2025-05-07T20:32:12.9490786Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43814e60>
2025-05-07T20:32:12.9491867Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.9493317Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43f32b60>}
2025-05-07T20:32:12.9494655Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.9495667Z context = <triton._C.libtriton.ir.context object at 0x7f6a42b19af0>
2025-05-07T20:32:12.9495958Z 
2025-05-07T20:32:12.9496127Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.9496655Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.9497114Z                            module_map=module_map)
2025-05-07T20:32:12.9497511Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.9497911Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.9498177Z E       ^
2025-05-07T20:32:12.9498638Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.9499088Z 
2025-05-07T20:32:12.9499507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.9500013Z 
2025-05-07T20:32:12.9500128Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.9500543Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.9500936Z     T=2048,
2025-05-07T20:32:12.9501140Z     D=7168,
2025-05-07T20:32:12.9501345Z     scale_ub=1200.0,
2025-05-07T20:32:12.9501572Z     contiguous=False,
2025-05-07T20:32:12.9501800Z     compiled=True,
2025-05-07T20:32:12.9502020Z )
2025-05-07T20:32:12.9502341Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.9502841Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.9503115Z 
2025-05-07T20:32:12.9503210Z     @given(
2025-05-07T20:32:12.9503441Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.9503768Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.9504088Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.9504420Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.9504748Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.9505048Z     )
2025-05-07T20:32:12.9505405Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.9505858Z     def test_silu_mul_quant(
2025-05-07T20:32:12.9506112Z         self,
2025-05-07T20:32:12.9506314Z         T: int,
2025-05-07T20:32:12.9506522Z         D: int,
2025-05-07T20:32:12.9506754Z         scale_ub: Optional[float],
2025-05-07T20:32:12.9507081Z         contiguous: bool,
2025-05-07T20:32:12.9507338Z         compiled: bool,
2025-05-07T20:32:12.9507559Z     ) -> None:
2025-05-07T20:32:12.9507783Z         torch.manual_seed(2025)
2025-05-07T20:32:12.9508037Z     
2025-05-07T20:32:12.9508307Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.9508743Z     
2025-05-07T20:32:12.9508951Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.9509240Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.9509558Z         x = x_sign * x_clamp
2025-05-07T20:32:12.9509813Z         x0 = x[:, :D]
2025-05-07T20:32:12.9510034Z         x1 = x[:, D:]
2025-05-07T20:32:12.9510253Z     
2025-05-07T20:32:12.9510450Z         if contiguous:
2025-05-07T20:32:12.9510801Z             x0 = x0.contiguous()
2025-05-07T20:32:12.9511065Z             x1 = x1.contiguous()
2025-05-07T20:32:12.9511314Z     
2025-05-07T20:32:12.9511512Z         if scale_ub is not None:
2025-05-07T20:32:12.9511798Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.9512143Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.9512456Z             )
2025-05-07T20:32:12.9512651Z         else:
2025-05-07T20:32:12.9512866Z             scale_ub_tensor = None
2025-05-07T20:32:12.9513126Z     
2025-05-07T20:32:12.9513363Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.9513680Z             op = silu_mul_quant
2025-05-07T20:32:12.9513938Z             if compiled:
2025-05-07T20:32:12.9514187Z                 op = torch.compile(op)
2025-05-07T20:32:12.9514493Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.9514786Z     
2025-05-07T20:32:12.9514980Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.9515151Z 
2025-05-07T20:32:12.9515259Z moe/activation_test.py:117: 
2025-05-07T20:32:12.9515562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.9515889Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.9516176Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.9516733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.9517286Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.9517952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.9518638Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.9519174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.9519845Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.9520504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.9521036Z     kernel = self.compile(
2025-05-07T20:32:12.9521580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.9522225Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.9522630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.9522859Z 
2025-05-07T20:32:12.9523081Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43e27aa0>
2025-05-07T20:32:12.9524143Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.9525492Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43f32200>}
2025-05-07T20:32:12.9526822Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.9527840Z context = <triton._C.libtriton.ir.context object at 0x7f6a427c48f0>
2025-05-07T20:32:12.9528122Z 
2025-05-07T20:32:12.9528295Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.9528890Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.9529361Z                            module_map=module_map)
2025-05-07T20:32:12.9529729Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.9530090Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.9530354Z E       ^
2025-05-07T20:32:12.9530821Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.9531342Z 
2025-05-07T20:32:12.9531760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.9532261Z 
2025-05-07T20:32:13.0697211Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.0697696Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.0698108Z     T=1,
2025-05-07T20:32:13.0698300Z     D=5120,
2025-05-07T20:32:13.0698532Z     scale_ub=None,
2025-05-07T20:32:13.0698761Z     contiguous=False,
2025-05-07T20:32:13.0698992Z     compiled=False,
2025-05-07T20:32:13.0699208Z )
2025-05-07T20:32:13.0699540Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.0700043Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:13.0700310Z 
2025-05-07T20:32:13.0700399Z     @given(
2025-05-07T20:32:13.0700657Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.0700994Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.0701309Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.0701653Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.0701995Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.0702282Z     )
2025-05-07T20:32:13.0702643Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.0703091Z     def test_silu_mul_quant(
2025-05-07T20:32:13.0703344Z         self,
2025-05-07T20:32:13.0703553Z         T: int,
2025-05-07T20:32:13.0703766Z         D: int,
2025-05-07T20:32:13.0703999Z         scale_ub: Optional[float],
2025-05-07T20:32:13.0704277Z         contiguous: bool,
2025-05-07T20:32:13.0704533Z         compiled: bool,
2025-05-07T20:32:13.0704777Z     ) -> None:
2025-05-07T20:32:13.0705000Z         torch.manual_seed(2025)
2025-05-07T20:32:13.0705261Z     
2025-05-07T20:32:13.0705552Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.0705896Z     
2025-05-07T20:32:13.0706097Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.0706399Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.0706717Z         x = x_sign * x_clamp
2025-05-07T20:32:13.0706994Z         x0 = x[:, :D]
2025-05-07T20:32:13.0707248Z         x1 = x[:, D:]
2025-05-07T20:32:13.0707458Z     
2025-05-07T20:32:13.0707654Z         if contiguous:
2025-05-07T20:32:13.0707894Z             x0 = x0.contiguous()
2025-05-07T20:32:13.0708171Z             x1 = x1.contiguous()
2025-05-07T20:32:13.0708422Z     
2025-05-07T20:32:13.0708627Z         if scale_ub is not None:
2025-05-07T20:32:13.0708915Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.0709255Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.0709575Z             )
2025-05-07T20:32:13.0709777Z         else:
2025-05-07T20:32:13.0709996Z             scale_ub_tensor = None
2025-05-07T20:32:13.0710274Z     
2025-05-07T20:32:13.0710519Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.0710838Z             op = silu_mul_quant
2025-05-07T20:32:13.0711104Z             if compiled:
2025-05-07T20:32:13.0711367Z                 op = torch.compile(op)
2025-05-07T20:32:13.0711675Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0711970Z     
2025-05-07T20:32:13.0712184Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.0712354Z 
2025-05-07T20:32:13.0712459Z moe/activation_test.py:117: 
2025-05-07T20:32:13.0713051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0713402Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.0713694Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0714383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.0715212Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.0715756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.0716442Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.0717120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.0717697Z     kernel = self.compile(
2025-05-07T20:32:13.0718279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.0718934Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.0719342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0719571Z 
2025-05-07T20:32:13.0719790Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43e246b0>
2025-05-07T20:32:13.0720865Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.0722241Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a581c39c0>}
2025-05-07T20:32:13.0723583Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.0724602Z context = <triton._C.libtriton.ir.context object at 0x7f6a4297d3f0>
2025-05-07T20:32:13.0724891Z 
2025-05-07T20:32:13.0725067Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.0725586Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.0726068Z                            module_map=module_map)
2025-05-07T20:32:13.0726446Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.0726813Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.0727071Z E       ^
2025-05-07T20:32:13.0727547Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.0728049Z 
2025-05-07T20:32:13.0728479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.0728986Z 
2025-05-07T20:32:13.0729098Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.0729511Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.0729919Z     T=4096,
2025-05-07T20:32:13.0730120Z     D=7168,
2025-05-07T20:32:13.0730314Z     scale_ub=1200.0,
2025-05-07T20:32:13.0730556Z     contiguous=False,
2025-05-07T20:32:13.0730793Z     compiled=False,
2025-05-07T20:32:13.0731006Z )
2025-05-07T20:32:13.0731332Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.0731836Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:13.0732110Z 
2025-05-07T20:32:13.0732219Z     @given(
2025-05-07T20:32:13.0732456Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.0732784Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.0733203Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.0733634Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.0733975Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.0734275Z     )
2025-05-07T20:32:13.0734639Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.0735082Z     def test_silu_mul_quant(
2025-05-07T20:32:13.0735341Z         self,
2025-05-07T20:32:13.0735555Z         T: int,
2025-05-07T20:32:13.0735838Z         D: int,
2025-05-07T20:32:13.0736072Z         scale_ub: Optional[float],
2025-05-07T20:32:13.0736357Z         contiguous: bool,
2025-05-07T20:32:13.0736602Z         compiled: bool,
2025-05-07T20:32:13.0736838Z     ) -> None:
2025-05-07T20:32:13.0737065Z         torch.manual_seed(2025)
2025-05-07T20:32:13.0737310Z     
2025-05-07T20:32:13.0737593Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.0737952Z     
2025-05-07T20:32:13.0738152Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.0738462Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.0738786Z         x = x_sign * x_clamp
2025-05-07T20:32:13.0739042Z         x0 = x[:, :D]
2025-05-07T20:32:13.0739272Z         x1 = x[:, D:]
2025-05-07T20:32:13.0739495Z     
2025-05-07T20:32:13.0739694Z         if contiguous:
2025-05-07T20:32:13.0739932Z             x0 = x0.contiguous()
2025-05-07T20:32:13.0740204Z             x1 = x1.contiguous()
2025-05-07T20:32:13.0740468Z     
2025-05-07T20:32:13.0740671Z         if scale_ub is not None:
2025-05-07T20:32:13.0740962Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.0741303Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.0741614Z             )
2025-05-07T20:32:13.0741812Z         else:
2025-05-07T20:32:13.0742039Z             scale_ub_tensor = None
2025-05-07T20:32:13.0742297Z     
2025-05-07T20:32:13.0742540Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.0742862Z             op = silu_mul_quant
2025-05-07T20:32:13.0743131Z             if compiled:
2025-05-07T20:32:13.0743393Z                 op = torch.compile(op)
2025-05-07T20:32:13.0743705Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0743987Z     
2025-05-07T20:32:13.0744183Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.0744355Z 
2025-05-07T20:32:13.0744459Z moe/activation_test.py:117: 
2025-05-07T20:32:13.0744764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0745099Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.0745387Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0746078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.0746780Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.0747371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.0748061Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.0748733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.0749273Z     kernel = self.compile(
2025-05-07T20:32:13.0749819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.0750485Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.0750910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0751139Z 
2025-05-07T20:32:13.0751347Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a439f06e0>
2025-05-07T20:32:13.0752417Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.0753886Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a583a49a0>}
2025-05-07T20:32:13.0755236Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.0756351Z context = <triton._C.libtriton.ir.context object at 0x7f6a42ee8530>
2025-05-07T20:32:13.0756637Z 
2025-05-07T20:32:13.0764980Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.0765548Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.0766028Z                            module_map=module_map)
2025-05-07T20:32:13.0766393Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.0766752Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.0767027Z E       ^
2025-05-07T20:32:13.0767487Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.0767940Z 
2025-05-07T20:32:13.0768363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.0768882Z 
2025-05-07T20:32:13.0768988Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.0769416Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.0769835Z     T=16384,
2025-05-07T20:32:13.0770073Z     D=7168,
2025-05-07T20:32:13.0770278Z     scale_ub=None,
2025-05-07T20:32:13.0770498Z     contiguous=True,
2025-05-07T20:32:13.0770732Z     compiled=True,
2025-05-07T20:32:13.0770952Z )
2025-05-07T20:32:13.2540400Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.2540993Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:13.2541268Z 
2025-05-07T20:32:13.2541356Z     @given(
2025-05-07T20:32:13.2541591Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.2541917Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.2542232Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.2542564Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.2542905Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.2543212Z     )
2025-05-07T20:32:13.2543558Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.2544006Z     def test_silu_mul_quant(
2025-05-07T20:32:13.2544257Z         self,
2025-05-07T20:32:13.2544452Z         T: int,
2025-05-07T20:32:13.2544658Z         D: int,
2025-05-07T20:32:13.2544889Z         scale_ub: Optional[float],
2025-05-07T20:32:13.2545170Z         contiguous: bool,
2025-05-07T20:32:13.2545413Z         compiled: bool,
2025-05-07T20:32:13.2545655Z     ) -> None:
2025-05-07T20:32:13.2545879Z         torch.manual_seed(2025)
2025-05-07T20:32:13.2546123Z     
2025-05-07T20:32:13.2546407Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.2546753Z     
2025-05-07T20:32:13.2546948Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.2547247Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.2547562Z         x = x_sign * x_clamp
2025-05-07T20:32:13.2547804Z         x0 = x[:, :D]
2025-05-07T20:32:13.2548033Z         x1 = x[:, D:]
2025-05-07T20:32:13.2548250Z     
2025-05-07T20:32:13.2548434Z         if contiguous:
2025-05-07T20:32:13.2548672Z             x0 = x0.contiguous()
2025-05-07T20:32:13.2548934Z             x1 = x1.contiguous()
2025-05-07T20:32:13.2549174Z     
2025-05-07T20:32:13.2549374Z         if scale_ub is not None:
2025-05-07T20:32:13.2549651Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.2549994Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.2550647Z             )
2025-05-07T20:32:13.2550852Z         else:
2025-05-07T20:32:13.2551077Z             scale_ub_tensor = None
2025-05-07T20:32:13.2551341Z     
2025-05-07T20:32:13.2551587Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.2551913Z             op = silu_mul_quant
2025-05-07T20:32:13.2552166Z             if compiled:
2025-05-07T20:32:13.2552425Z                 op = torch.compile(op)
2025-05-07T20:32:13.2552884Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.2553155Z     
2025-05-07T20:32:13.2553361Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.2553524Z 
2025-05-07T20:32:13.2553636Z moe/activation_test.py:117: 
2025-05-07T20:32:13.2553931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.2554272Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.2554563Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.2555137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.2555695Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.2556360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.2557051Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.2557582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.2558269Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.2558933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.2559705Z     kernel = self.compile(
2025-05-07T20:32:13.2560241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.2560903Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.2561308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.2561538Z 
2025-05-07T20:32:13.2561753Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43f4e810>
2025-05-07T20:32:13.2562820Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.2564204Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a59351da0>}
2025-05-07T20:32:13.2565535Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.2566557Z context = <triton._C.libtriton.ir.context object at 0x7f6a427ef3f0>
2025-05-07T20:32:13.2566841Z 
2025-05-07T20:32:13.2567008Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.2567528Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.2567996Z                            module_map=module_map)
2025-05-07T20:32:13.2568367Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.2568721Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.2568991Z E       ^
2025-05-07T20:32:13.2569457Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.2569900Z 
2025-05-07T20:32:13.2570311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.2570823Z 
2025-05-07T20:32:13.2570929Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.2571472Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.2571876Z     T=4096,
2025-05-07T20:32:13.2572063Z     D=5120,
2025-05-07T20:32:13.2572265Z     scale_ub=None,
2025-05-07T20:32:13.2572495Z     contiguous=False,
2025-05-07T20:32:13.2572720Z     compiled=True,
2025-05-07T20:32:13.2572931Z )
2025-05-07T20:32:13.2573334Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.2573941Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:13.2574214Z 
2025-05-07T20:32:13.2574296Z     @given(
2025-05-07T20:32:13.2574533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.2574848Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.2575154Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.2575490Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.2575824Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.2576113Z     )
2025-05-07T20:32:13.2576469Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.2576922Z     def test_silu_mul_quant(
2025-05-07T20:32:13.2577164Z         self,
2025-05-07T20:32:13.2577376Z         T: int,
2025-05-07T20:32:13.2577587Z         D: int,
2025-05-07T20:32:13.2577813Z         scale_ub: Optional[float],
2025-05-07T20:32:13.2578098Z         contiguous: bool,
2025-05-07T20:32:13.2578364Z         compiled: bool,
2025-05-07T20:32:13.2578604Z     ) -> None:
2025-05-07T20:32:13.2578838Z         torch.manual_seed(2025)
2025-05-07T20:32:13.2579095Z     
2025-05-07T20:32:13.2579383Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.2579728Z     
2025-05-07T20:32:13.2579937Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.2580244Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.2580564Z         x = x_sign * x_clamp
2025-05-07T20:32:13.2580826Z         x0 = x[:, :D]
2025-05-07T20:32:13.2581059Z         x1 = x[:, D:]
2025-05-07T20:32:13.2581276Z     
2025-05-07T20:32:13.2581477Z         if contiguous:
2025-05-07T20:32:13.2581725Z             x0 = x0.contiguous()
2025-05-07T20:32:13.2581996Z             x1 = x1.contiguous()
2025-05-07T20:32:13.2582247Z     
2025-05-07T20:32:13.2582445Z         if scale_ub is not None:
2025-05-07T20:32:13.2582725Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.2583073Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.2583396Z             )
2025-05-07T20:32:13.2583590Z         else:
2025-05-07T20:32:13.2583812Z             scale_ub_tensor = None
2025-05-07T20:32:13.2584075Z     
2025-05-07T20:32:13.2584318Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.2584632Z             op = silu_mul_quant
2025-05-07T20:32:13.2584893Z             if compiled:
2025-05-07T20:32:13.2585153Z                 op = torch.compile(op)
2025-05-07T20:32:13.2585457Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.2585743Z     
2025-05-07T20:32:13.2585945Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.2586111Z 
2025-05-07T20:32:13.2586213Z moe/activation_test.py:117: 
2025-05-07T20:32:13.2586515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.2586849Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.2587131Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.2587697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.2588260Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.2588930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.2589608Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.2590229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.2590910Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.2591565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.2592100Z     kernel = self.compile(
2025-05-07T20:32:13.2592646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.2593373Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.2593766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.2594000Z 
2025-05-07T20:32:13.2594206Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43f4c5c0>
2025-05-07T20:32:13.2595279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.2596633Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a59353b00>}
2025-05-07T20:32:13.2597958Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.2598971Z context = <triton._C.libtriton.ir.context object at 0x7f6a42535770>
2025-05-07T20:32:13.2599263Z 
2025-05-07T20:32:13.2599430Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.2599949Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.2600412Z                            module_map=module_map)
2025-05-07T20:32:13.2600781Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.2601140Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.2601410Z E       ^
2025-05-07T20:32:13.2601866Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.2602314Z 
2025-05-07T20:32:13.2602726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.2603236Z 
2025-05-07T20:32:13.4074692Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.4075162Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.4075564Z     T=4096,
2025-05-07T20:32:13.4075763Z     D=5120,
2025-05-07T20:32:13.4075965Z     scale_ub=1200.0,
2025-05-07T20:32:13.4076183Z     contiguous=False,
2025-05-07T20:32:13.4076414Z     compiled=False,
2025-05-07T20:32:13.4076646Z )
2025-05-07T20:32:13.4076959Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.4077475Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:13.4077751Z 
2025-05-07T20:32:13.4077847Z     @given(
2025-05-07T20:32:13.4078085Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.4078394Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.4078706Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.4079043Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.4079380Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.4079674Z     )
2025-05-07T20:32:13.4080031Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.4080468Z     def test_silu_mul_quant(
2025-05-07T20:32:13.4080720Z         self,
2025-05-07T20:32:13.4080929Z         T: int,
2025-05-07T20:32:13.4081125Z         D: int,
2025-05-07T20:32:13.4081354Z         scale_ub: Optional[float],
2025-05-07T20:32:13.4081638Z         contiguous: bool,
2025-05-07T20:32:13.4082195Z         compiled: bool,
2025-05-07T20:32:13.4082430Z     ) -> None:
2025-05-07T20:32:13.4082652Z         torch.manual_seed(2025)
2025-05-07T20:32:13.4082897Z     
2025-05-07T20:32:13.4083162Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.4083516Z     
2025-05-07T20:32:13.4083716Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.4084007Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.4084500Z         x = x_sign * x_clamp
2025-05-07T20:32:13.4084751Z         x0 = x[:, :D]
2025-05-07T20:32:13.4084973Z         x1 = x[:, D:]
2025-05-07T20:32:13.4085190Z     
2025-05-07T20:32:13.4085386Z         if contiguous:
2025-05-07T20:32:13.4085617Z             x0 = x0.contiguous()
2025-05-07T20:32:13.4085882Z             x1 = x1.contiguous()
2025-05-07T20:32:13.4086128Z     
2025-05-07T20:32:13.4086321Z         if scale_ub is not None:
2025-05-07T20:32:13.4086601Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.4086946Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.4087277Z             )
2025-05-07T20:32:13.4087506Z         else:
2025-05-07T20:32:13.4087722Z             scale_ub_tensor = None
2025-05-07T20:32:13.4087970Z     
2025-05-07T20:32:13.4088212Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.4088528Z             op = silu_mul_quant
2025-05-07T20:32:13.4088784Z             if compiled:
2025-05-07T20:32:13.4089044Z                 op = torch.compile(op)
2025-05-07T20:32:13.4089344Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.4089628Z     
2025-05-07T20:32:13.4089820Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.4089991Z 
2025-05-07T20:32:13.4090091Z moe/activation_test.py:117: 
2025-05-07T20:32:13.4090423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.4090759Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.4091041Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.4091743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.4092436Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.4093065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.4093743Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.4094422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.4094956Z     kernel = self.compile(
2025-05-07T20:32:13.4095500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.4096160Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.4096559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.4096794Z 
2025-05-07T20:32:13.4097024Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a58254a40>
2025-05-07T20:32:13.4098117Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.4099484Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a59c779c0>}
2025-05-07T20:32:13.4100814Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.4101828Z context = <triton._C.libtriton.ir.context object at 0x7f6a4294e270>
2025-05-07T20:32:13.4102113Z 
2025-05-07T20:32:13.4102375Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.4102891Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.4103364Z                            module_map=module_map)
2025-05-07T20:32:13.4103735Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.4104084Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.4104428Z E       ^
2025-05-07T20:32:13.4104893Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.4105335Z 
2025-05-07T20:32:13.4105756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.4106260Z 
2025-05-07T20:32:13.4106363Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.4106787Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.4107219Z     T=4096,
2025-05-07T20:32:13.4107433Z     D=5120,
2025-05-07T20:32:13.4107631Z     scale_ub=1200.0,
2025-05-07T20:32:13.4107858Z     contiguous=False,
2025-05-07T20:32:13.4108080Z     compiled=True,
2025-05-07T20:32:13.4108297Z )
2025-05-07T20:32:13.4108616Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.4109112Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:13.4109389Z 
2025-05-07T20:32:13.4109470Z     @given(
2025-05-07T20:32:13.4109709Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.4110026Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.4110332Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.4110673Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.4111004Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.4111297Z     )
2025-05-07T20:32:13.4111665Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.4112107Z     def test_silu_mul_quant(
2025-05-07T20:32:13.4112351Z         self,
2025-05-07T20:32:13.4112544Z         T: int,
2025-05-07T20:32:13.4112745Z         D: int,
2025-05-07T20:32:13.4112969Z         scale_ub: Optional[float],
2025-05-07T20:32:13.4113232Z         contiguous: bool,
2025-05-07T20:32:13.4113474Z         compiled: bool,
2025-05-07T20:32:13.4113698Z     ) -> None:
2025-05-07T20:32:13.4113918Z         torch.manual_seed(2025)
2025-05-07T20:32:13.4114170Z     
2025-05-07T20:32:13.4114448Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.4114786Z     
2025-05-07T20:32:13.4114991Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.4115287Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.4115595Z         x = x_sign * x_clamp
2025-05-07T20:32:13.4115844Z         x0 = x[:, :D]
2025-05-07T20:32:13.4116065Z         x1 = x[:, D:]
2025-05-07T20:32:13.4116273Z     
2025-05-07T20:32:13.4116466Z         if contiguous:
2025-05-07T20:32:13.4116703Z             x0 = x0.contiguous()
2025-05-07T20:32:13.4116962Z             x1 = x1.contiguous()
2025-05-07T20:32:13.4117213Z     
2025-05-07T20:32:13.4117419Z         if scale_ub is not None:
2025-05-07T20:32:13.4117697Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.4118026Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.4118337Z             )
2025-05-07T20:32:13.4118534Z         else:
2025-05-07T20:32:13.4118740Z             scale_ub_tensor = None
2025-05-07T20:32:13.4118992Z     
2025-05-07T20:32:13.4119227Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.4119536Z             op = silu_mul_quant
2025-05-07T20:32:13.4119789Z             if compiled:
2025-05-07T20:32:13.4120045Z                 op = torch.compile(op)
2025-05-07T20:32:13.4120337Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.4120613Z     
2025-05-07T20:32:13.4120894Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.4121061Z 
2025-05-07T20:32:13.4121160Z moe/activation_test.py:117: 
2025-05-07T20:32:13.4121459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.4121788Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.4122071Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.4122622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.4123262Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.4123917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.4124589Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.4125122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.4125795Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.4126455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.4126974Z     kernel = self.compile(
2025-05-07T20:32:13.4127511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.4128209Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.4128612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.4128838Z 
2025-05-07T20:32:13.4129044Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a58255490>
2025-05-07T20:32:13.4130107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.4131461Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a5985d8a0>}
2025-05-07T20:32:13.4132778Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.4133891Z context = <triton._C.libtriton.ir.context object at 0x7f6a429d41f0>
2025-05-07T20:32:13.4134186Z 
2025-05-07T20:32:13.4134352Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.4134871Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.4135344Z                            module_map=module_map)
2025-05-07T20:32:13.4135707Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.4136062Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.4136323Z E       ^
2025-05-07T20:32:13.4136786Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.4137264Z 
2025-05-07T20:32:13.4137701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.4138215Z 
2025-05-07T20:32:13.5289742Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.5290191Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.5290634Z     T=2048,
2025-05-07T20:32:13.5290826Z     D=7168,
2025-05-07T20:32:13.5291027Z     scale_ub=1200.0,
2025-05-07T20:32:13.5291252Z     contiguous=False,
2025-05-07T20:32:13.5291484Z     compiled=False,
2025-05-07T20:32:13.5291690Z )
2025-05-07T20:32:13.5292006Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.5292502Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:13.5292773Z 
2025-05-07T20:32:13.5293236Z     @given(
2025-05-07T20:32:13.5293473Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.5293790Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.5294090Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.5294418Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.5294746Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.5295177Z     )
2025-05-07T20:32:13.5295524Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.5295962Z     def test_silu_mul_quant(
2025-05-07T20:32:13.5296208Z         self,
2025-05-07T20:32:13.5296397Z         T: int,
2025-05-07T20:32:13.5296604Z         D: int,
2025-05-07T20:32:13.5296825Z         scale_ub: Optional[float],
2025-05-07T20:32:13.5297091Z         contiguous: bool,
2025-05-07T20:32:13.5297334Z         compiled: bool,
2025-05-07T20:32:13.5297564Z     ) -> None:
2025-05-07T20:32:13.5297786Z         torch.manual_seed(2025)
2025-05-07T20:32:13.5298027Z     
2025-05-07T20:32:13.5298297Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.5298640Z     
2025-05-07T20:32:13.5298838Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.5299133Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.5299441Z         x = x_sign * x_clamp
2025-05-07T20:32:13.5299705Z         x0 = x[:, :D]
2025-05-07T20:32:13.5299937Z         x1 = x[:, D:]
2025-05-07T20:32:13.5300149Z     
2025-05-07T20:32:13.5300347Z         if contiguous:
2025-05-07T20:32:13.5300589Z             x0 = x0.contiguous()
2025-05-07T20:32:13.5300857Z             x1 = x1.contiguous()
2025-05-07T20:32:13.5310111Z     
2025-05-07T20:32:13.5310341Z         if scale_ub is not None:
2025-05-07T20:32:13.5310639Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.5310981Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.5311299Z             )
2025-05-07T20:32:13.5311512Z         else:
2025-05-07T20:32:13.5311722Z             scale_ub_tensor = None
2025-05-07T20:32:13.5311990Z     
2025-05-07T20:32:13.5312235Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.5312556Z             op = silu_mul_quant
2025-05-07T20:32:13.5312805Z             if compiled:
2025-05-07T20:32:13.5313058Z                 op = torch.compile(op)
2025-05-07T20:32:13.5313366Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.5313646Z     
2025-05-07T20:32:13.5313857Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.5314022Z 
2025-05-07T20:32:13.5314130Z moe/activation_test.py:117: 
2025-05-07T20:32:13.5314426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.5314763Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.5315051Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.5315741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.5316430Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.5316964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.5317644Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.5318301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.5318830Z     kernel = self.compile(
2025-05-07T20:32:13.5319373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.5320024Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.5320420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.5320654Z 
2025-05-07T20:32:13.5320864Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a584c94f0>
2025-05-07T20:32:13.5322069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.5323435Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a5a2ecc20>}
2025-05-07T20:32:13.5324828Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.5325839Z context = <triton._C.libtriton.ir.context object at 0x7f6a42962a30>
2025-05-07T20:32:13.5326136Z 
2025-05-07T20:32:13.5326306Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.5326831Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.5327295Z                            module_map=module_map)
2025-05-07T20:32:13.5327668Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.5328022Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.5328278Z E       ^
2025-05-07T20:32:13.5328742Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.5329200Z 
2025-05-07T20:32:13.5329609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.5330109Z 
2025-05-07T20:32:13.5330221Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.5330626Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.5331032Z     T=1,
2025-05-07T20:32:13.5331228Z     D=7168,
2025-05-07T20:32:13.5331420Z     scale_ub=None,
2025-05-07T20:32:13.5331647Z     contiguous=True,
2025-05-07T20:32:13.5331904Z     compiled=False,
2025-05-07T20:32:13.5332116Z )
2025-05-07T20:32:13.5332436Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.5332928Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:13.5333277Z 
2025-05-07T20:32:13.5333362Z     @given(
2025-05-07T20:32:13.5333599Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.5333913Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.5334226Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.5334565Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.5334890Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.5335183Z     )
2025-05-07T20:32:13.5335531Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.5335968Z     def test_silu_mul_quant(
2025-05-07T20:32:13.5336218Z         self,
2025-05-07T20:32:13.5336432Z         T: int,
2025-05-07T20:32:13.5336641Z         D: int,
2025-05-07T20:32:13.5336862Z         scale_ub: Optional[float],
2025-05-07T20:32:13.5337146Z         contiguous: bool,
2025-05-07T20:32:13.5337399Z         compiled: bool,
2025-05-07T20:32:13.5337619Z     ) -> None:
2025-05-07T20:32:13.5337850Z         torch.manual_seed(2025)
2025-05-07T20:32:13.5338105Z     
2025-05-07T20:32:13.5338372Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.5338735Z     
2025-05-07T20:32:13.5338940Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.5339232Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.5339551Z         x = x_sign * x_clamp
2025-05-07T20:32:13.5339805Z         x0 = x[:, :D]
2025-05-07T20:32:13.5340027Z         x1 = x[:, D:]
2025-05-07T20:32:13.5340243Z     
2025-05-07T20:32:13.5340435Z         if contiguous:
2025-05-07T20:32:13.5340663Z             x0 = x0.contiguous()
2025-05-07T20:32:13.5341016Z             x1 = x1.contiguous()
2025-05-07T20:32:13.5341266Z     
2025-05-07T20:32:13.5341458Z         if scale_ub is not None:
2025-05-07T20:32:13.5341740Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.5342086Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.5342408Z             )
2025-05-07T20:32:13.5342602Z         else:
2025-05-07T20:32:13.5342821Z             scale_ub_tensor = None
2025-05-07T20:32:13.5343157Z     
2025-05-07T20:32:13.5343385Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.5343706Z             op = silu_mul_quant
2025-05-07T20:32:13.5343957Z             if compiled:
2025-05-07T20:32:13.5344202Z                 op = torch.compile(op)
2025-05-07T20:32:13.5344504Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.5344788Z     
2025-05-07T20:32:13.5344980Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.5345154Z 
2025-05-07T20:32:13.5345257Z moe/activation_test.py:117: 
2025-05-07T20:32:13.5345565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.5345899Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.5346176Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.5346870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.5347560Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.5348102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.5348789Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.5349463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.5350003Z     kernel = self.compile(
2025-05-07T20:32:13.5350547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.5351203Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.5351608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.5351841Z 
2025-05-07T20:32:13.5352047Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a584cbb90>
2025-05-07T20:32:13.5353120Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.5354482Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a5a2ee700>}
2025-05-07T20:32:13.5355816Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.5356832Z context = <triton._C.libtriton.ir.context object at 0x7f6a42ed98b0>
2025-05-07T20:32:13.5357120Z 
2025-05-07T20:32:13.5357285Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.5357811Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.5358321Z                            module_map=module_map)
2025-05-07T20:32:13.5358730Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.5359093Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.5359674Z E       ^
2025-05-07T20:32:13.5360155Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.5360602Z 
2025-05-07T20:32:13.5361019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.5361535Z 
2025-05-07T20:32:13.5361798Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.5362223Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.5362640Z     T=16384,
2025-05-07T20:32:13.5362850Z     D=7168,
2025-05-07T20:32:13.5363060Z     scale_ub=1200.0,
2025-05-07T20:32:13.5363294Z     contiguous=False,
2025-05-07T20:32:13.5363528Z     compiled=True,
2025-05-07T20:32:13.7763785Z )
2025-05-07T20:32:13.7764563Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.7765084Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:13.7765365Z 
2025-05-07T20:32:13.7765449Z     @given(
2025-05-07T20:32:13.7765696Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.7766021Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.7766332Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.7766700Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.7767037Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.7767320Z     )
2025-05-07T20:32:13.7767671Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.7768118Z     def test_silu_mul_quant(
2025-05-07T20:32:13.7768363Z         self,
2025-05-07T20:32:13.7768571Z         T: int,
2025-05-07T20:32:13.7768780Z         D: int,
2025-05-07T20:32:13.7769003Z         scale_ub: Optional[float],
2025-05-07T20:32:13.7769281Z         contiguous: bool,
2025-05-07T20:32:13.7769555Z         compiled: bool,
2025-05-07T20:32:13.7769788Z     ) -> None:
2025-05-07T20:32:13.7770012Z         torch.manual_seed(2025)
2025-05-07T20:32:13.7770260Z     
2025-05-07T20:32:13.7770535Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.7770885Z     
2025-05-07T20:32:13.7771077Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.7771369Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.7771687Z         x = x_sign * x_clamp
2025-05-07T20:32:13.7771932Z         x0 = x[:, :D]
2025-05-07T20:32:13.7772160Z         x1 = x[:, D:]
2025-05-07T20:32:13.7772377Z     
2025-05-07T20:32:13.7772571Z         if contiguous:
2025-05-07T20:32:13.7772824Z             x0 = x0.contiguous()
2025-05-07T20:32:13.7773166Z             x1 = x1.contiguous()
2025-05-07T20:32:13.7773420Z     
2025-05-07T20:32:13.7773620Z         if scale_ub is not None:
2025-05-07T20:32:13.7773909Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.7774246Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.7774549Z             )
2025-05-07T20:32:13.7774749Z         else:
2025-05-07T20:32:13.7774981Z             scale_ub_tensor = None
2025-05-07T20:32:13.7775235Z     
2025-05-07T20:32:13.7775472Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.7775790Z             op = silu_mul_quant
2025-05-07T20:32:13.7776038Z             if compiled:
2025-05-07T20:32:13.7776301Z                 op = torch.compile(op)
2025-05-07T20:32:13.7776607Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.7776882Z     
2025-05-07T20:32:13.7777085Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.7777275Z 
2025-05-07T20:32:13.7777396Z moe/activation_test.py:117: 
2025-05-07T20:32:13.7777702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.7778039Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.7778326Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.7778890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.7779448Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.7780108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.7780792Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.7781484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.7782162Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.7782824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.7783354Z     kernel = self.compile(
2025-05-07T20:32:13.7783996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.7784649Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.7785048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.7785274Z 
2025-05-07T20:32:13.7785479Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a58866930>
2025-05-07T20:32:13.7786559Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.7787990Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a700c2340>}
2025-05-07T20:32:13.7789316Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.7790333Z context = <triton._C.libtriton.ir.context object at 0x7f6a42e6afb0>
2025-05-07T20:32:13.7790618Z 
2025-05-07T20:32:13.7790784Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.7791305Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.7791775Z                            module_map=module_map)
2025-05-07T20:32:13.7792147Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.7792499Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.7792765Z E       ^
2025-05-07T20:32:13.7793236Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.7793681Z 
2025-05-07T20:32:13.7794095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.7794615Z 
2025-05-07T20:32:13.7794722Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.7795140Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.7795538Z     T=1,
2025-05-07T20:32:13.7795726Z     D=7168,
2025-05-07T20:32:13.7795936Z     scale_ub=None,
2025-05-07T20:32:13.7796164Z     contiguous=False,
2025-05-07T20:32:13.7796393Z     compiled=False,
2025-05-07T20:32:13.7796606Z )
2025-05-07T20:32:13.7796935Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.7797415Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:13.7797682Z 
2025-05-07T20:32:13.7797766Z     @given(
2025-05-07T20:32:13.7798001Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.7798316Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.7798621Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.7798961Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.7799297Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.7799580Z     )
2025-05-07T20:32:13.7799937Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.7800380Z     def test_silu_mul_quant(
2025-05-07T20:32:13.7800619Z         self,
2025-05-07T20:32:13.7800826Z         T: int,
2025-05-07T20:32:13.7801037Z         D: int,
2025-05-07T20:32:13.7801341Z         scale_ub: Optional[float],
2025-05-07T20:32:13.7801620Z         contiguous: bool,
2025-05-07T20:32:13.7801871Z         compiled: bool,
2025-05-07T20:32:13.7802096Z     ) -> None:
2025-05-07T20:32:13.7802322Z         torch.manual_seed(2025)
2025-05-07T20:32:13.7802579Z     
2025-05-07T20:32:13.7802850Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.7803195Z     
2025-05-07T20:32:13.7803396Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.7803765Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.7804073Z         x = x_sign * x_clamp
2025-05-07T20:32:13.7804321Z         x0 = x[:, :D]
2025-05-07T20:32:13.7804546Z         x1 = x[:, D:]
2025-05-07T20:32:13.7804754Z     
2025-05-07T20:32:13.7804943Z         if contiguous:
2025-05-07T20:32:13.7805182Z             x0 = x0.contiguous()
2025-05-07T20:32:13.7805438Z             x1 = x1.contiguous()
2025-05-07T20:32:13.7805686Z     
2025-05-07T20:32:13.7805884Z         if scale_ub is not None:
2025-05-07T20:32:13.7806165Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.7806504Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.7806817Z             )
2025-05-07T20:32:13.7807014Z         else:
2025-05-07T20:32:13.7807233Z             scale_ub_tensor = None
2025-05-07T20:32:13.7807489Z     
2025-05-07T20:32:13.7807719Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.7808043Z             op = silu_mul_quant
2025-05-07T20:32:13.7808297Z             if compiled:
2025-05-07T20:32:13.7808551Z                 op = torch.compile(op)
2025-05-07T20:32:13.7808844Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.7809125Z     
2025-05-07T20:32:13.7809327Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.7809491Z 
2025-05-07T20:32:13.7809591Z moe/activation_test.py:117: 
2025-05-07T20:32:13.7809892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.7810228Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.7810512Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.7811199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.7811881Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.7812423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.7813160Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.7813821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.7814354Z     kernel = self.compile(
2025-05-07T20:32:13.7814892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.7815547Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.7815959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.7816191Z 
2025-05-07T20:32:13.7816414Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a593001d0>
2025-05-07T20:32:13.7817479Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.7818846Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a703bbc40>}
2025-05-07T20:32:13.7820179Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.7821274Z context = <triton._C.libtriton.ir.context object at 0x7f6a42619b30>
2025-05-07T20:32:13.7821561Z 
2025-05-07T20:32:13.7821735Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.7822248Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.7822716Z                            module_map=module_map)
2025-05-07T20:32:13.7823082Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.7823508Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.7823771Z E       ^
2025-05-07T20:32:13.7824235Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.7824677Z 
2025-05-07T20:32:13.7825093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.7825596Z 
2025-05-07T20:32:13.7825701Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.7826124Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.7826521Z     T=2048,
2025-05-07T20:32:13.7826710Z     D=7168,
2025-05-07T20:32:13.7826906Z     scale_ub=None,
2025-05-07T20:32:13.7827128Z     contiguous=False,
2025-05-07T20:32:13.7827349Z     compiled=True,
2025-05-07T20:32:13.7827559Z )
2025-05-07T20:32:13.8694904Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.8695420Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:13.8695710Z 
2025-05-07T20:32:13.8695787Z     @given(
2025-05-07T20:32:13.8696021Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.8696341Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.8696640Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.8696978Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.8697339Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.8697654Z     )
2025-05-07T20:32:13.8698006Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.8698451Z     def test_silu_mul_quant(
2025-05-07T20:32:13.8698702Z         self,
2025-05-07T20:32:13.8698895Z         T: int,
2025-05-07T20:32:13.8699101Z         D: int,
2025-05-07T20:32:13.8699327Z         scale_ub: Optional[float],
2025-05-07T20:32:13.8699598Z         contiguous: bool,
2025-05-07T20:32:13.8699841Z         compiled: bool,
2025-05-07T20:32:13.8700076Z     ) -> None:
2025-05-07T20:32:13.8700290Z         torch.manual_seed(2025)
2025-05-07T20:32:13.8700538Z     
2025-05-07T20:32:13.8700814Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.8701154Z     
2025-05-07T20:32:13.8701392Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.8701684Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.8702007Z         x = x_sign * x_clamp
2025-05-07T20:32:13.8702254Z         x0 = x[:, :D]
2025-05-07T20:32:13.8702473Z         x1 = x[:, D:]
2025-05-07T20:32:13.8702686Z     
2025-05-07T20:32:13.8702878Z         if contiguous:
2025-05-07T20:32:13.8703106Z             x0 = x0.contiguous()
2025-05-07T20:32:13.8703378Z             x1 = x1.contiguous()
2025-05-07T20:32:13.8703622Z     
2025-05-07T20:32:13.8703821Z         if scale_ub is not None:
2025-05-07T20:32:13.8704097Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.8704430Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.8704749Z             )
2025-05-07T20:32:13.8704943Z         else:
2025-05-07T20:32:13.8705157Z             scale_ub_tensor = None
2025-05-07T20:32:13.8705412Z     
2025-05-07T20:32:13.8705647Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.8705963Z             op = silu_mul_quant
2025-05-07T20:32:13.8706219Z             if compiled:
2025-05-07T20:32:13.8706464Z                 op = torch.compile(op)
2025-05-07T20:32:13.8706769Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.8707453Z     
2025-05-07T20:32:13.8707649Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.8707820Z 
2025-05-07T20:32:13.8707918Z moe/activation_test.py:117: 
2025-05-07T20:32:13.8708226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.8708559Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.8708838Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.8709537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.8710098Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.8710747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.8711428Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.8711966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.8712646Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.8713302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.8713834Z     kernel = self.compile(
2025-05-07T20:32:13.8714374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.8715024Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.8715427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.8715662Z 
2025-05-07T20:32:13.8715869Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a593019d0>
2025-05-07T20:32:13.8716959Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.8718321Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a9c629800>}
2025-05-07T20:32:13.8719638Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.8720652Z context = <triton._C.libtriton.ir.context object at 0x7f6a42655cf0>
2025-05-07T20:32:13.8720940Z 
2025-05-07T20:32:13.8721113Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.8721632Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.8722093Z                            module_map=module_map)
2025-05-07T20:32:13.8722461Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.8722828Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.8731566Z E       ^
2025-05-07T20:32:13.8732077Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.8732537Z 
2025-05-07T20:32:13.8732959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.8733568Z 
2025-05-07T20:32:13.8733694Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.8734110Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.8734514Z     T=4096,
2025-05-07T20:32:13.8734710Z     D=7168,
2025-05-07T20:32:13.8734897Z     scale_ub=None,
2025-05-07T20:32:13.8735117Z     contiguous=False,
2025-05-07T20:32:13.8735341Z     compiled=True,
2025-05-07T20:32:13.8735555Z )
2025-05-07T20:32:13.8735872Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.8736484Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:13.8736755Z 
2025-05-07T20:32:13.8736846Z     @given(
2025-05-07T20:32:13.8737082Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.8737453Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.8737777Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.8738104Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.8738516Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.8738805Z     )
2025-05-07T20:32:13.8739166Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.8739605Z     def test_silu_mul_quant(
2025-05-07T20:32:13.8739852Z         self,
2025-05-07T20:32:13.8740060Z         T: int,
2025-05-07T20:32:13.8740260Z         D: int,
2025-05-07T20:32:13.8740490Z         scale_ub: Optional[float],
2025-05-07T20:32:13.8740773Z         contiguous: bool,
2025-05-07T20:32:13.8741014Z         compiled: bool,
2025-05-07T20:32:13.8741257Z     ) -> None:
2025-05-07T20:32:13.8741488Z         torch.manual_seed(2025)
2025-05-07T20:32:13.8741724Z     
2025-05-07T20:32:13.8742001Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.8742351Z     
2025-05-07T20:32:13.8742546Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.8742846Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.8743166Z         x = x_sign * x_clamp
2025-05-07T20:32:13.8743418Z         x0 = x[:, :D]
2025-05-07T20:32:13.8743651Z         x1 = x[:, D:]
2025-05-07T20:32:13.8743868Z     
2025-05-07T20:32:13.8744059Z         if contiguous:
2025-05-07T20:32:13.8744298Z             x0 = x0.contiguous()
2025-05-07T20:32:13.8744571Z             x1 = x1.contiguous()
2025-05-07T20:32:13.8744821Z     
2025-05-07T20:32:13.8745014Z         if scale_ub is not None:
2025-05-07T20:32:13.8745303Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.8745655Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.8745961Z             )
2025-05-07T20:32:13.8746163Z         else:
2025-05-07T20:32:13.8746384Z             scale_ub_tensor = None
2025-05-07T20:32:13.8746637Z     
2025-05-07T20:32:13.8746884Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.8747237Z             op = silu_mul_quant
2025-05-07T20:32:13.8747514Z             if compiled:
2025-05-07T20:32:13.8747776Z                 op = torch.compile(op)
2025-05-07T20:32:13.8748088Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.8748363Z     
2025-05-07T20:32:13.8748566Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.8748731Z 
2025-05-07T20:32:13.8748845Z moe/activation_test.py:117: 
2025-05-07T20:32:13.8749149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.8749481Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.8749770Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.8750336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.8750895Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.8751560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.8752261Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.8752825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.8753505Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.8754171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.8754709Z     kernel = self.compile(
2025-05-07T20:32:13.8755250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.8755997Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.8756411Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.8756642Z 
2025-05-07T20:32:13.8756865Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a58ff3dd0>
2025-05-07T20:32:13.8757986Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.8759673Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43953880>}
2025-05-07T20:32:13.8761024Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.8762074Z context = <triton._C.libtriton.ir.context object at 0x7f6a42338c30>
2025-05-07T20:32:13.8762370Z 
2025-05-07T20:32:13.8762562Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.8763098Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.8763577Z                            module_map=module_map)
2025-05-07T20:32:13.8763969Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.8764331Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.8764605Z E       ^
2025-05-07T20:32:13.8765092Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.8765549Z 
2025-05-07T20:32:13.8765971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.8766476Z 
2025-05-07T20:32:14.0344886Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.0345665Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.0346250Z     T=16384,
2025-05-07T20:32:14.0346541Z     D=5120,
2025-05-07T20:32:14.0346831Z     scale_ub=1200.0,
2025-05-07T20:32:14.0347073Z     contiguous=False,
2025-05-07T20:32:14.0347308Z     compiled=False,
2025-05-07T20:32:14.0347541Z )
2025-05-07T20:32:14.0347877Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.0348404Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.0348704Z 
2025-05-07T20:32:14.0348785Z     @given(
2025-05-07T20:32:14.0349035Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.0349360Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.0349672Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.0350011Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.0350354Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.0350645Z     )
2025-05-07T20:32:14.0351008Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.0351456Z     def test_silu_mul_quant(
2025-05-07T20:32:14.0351706Z         self,
2025-05-07T20:32:14.0351918Z         T: int,
2025-05-07T20:32:14.0352129Z         D: int,
2025-05-07T20:32:14.0352351Z         scale_ub: Optional[float],
2025-05-07T20:32:14.0352644Z         contiguous: bool,
2025-05-07T20:32:14.0352898Z         compiled: bool,
2025-05-07T20:32:14.0353144Z     ) -> None:
2025-05-07T20:32:14.0353363Z         torch.manual_seed(2025)
2025-05-07T20:32:14.0353615Z     
2025-05-07T20:32:14.0353902Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.0354249Z     
2025-05-07T20:32:14.0354457Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.0354755Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.0355067Z         x = x_sign * x_clamp
2025-05-07T20:32:14.0355571Z         x0 = x[:, :D]
2025-05-07T20:32:14.0355803Z         x1 = x[:, D:]
2025-05-07T20:32:14.0356010Z     
2025-05-07T20:32:14.0356207Z         if contiguous:
2025-05-07T20:32:14.0356448Z             x0 = x0.contiguous()
2025-05-07T20:32:14.0356710Z             x1 = x1.contiguous()
2025-05-07T20:32:14.0356964Z     
2025-05-07T20:32:14.0357168Z         if scale_ub is not None:
2025-05-07T20:32:14.0357447Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.0357937Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.0358265Z             )
2025-05-07T20:32:14.0358469Z         else:
2025-05-07T20:32:14.0358679Z             scale_ub_tensor = None
2025-05-07T20:32:14.0358939Z     
2025-05-07T20:32:14.0359411Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.0359769Z             op = silu_mul_quant
2025-05-07T20:32:14.0360034Z             if compiled:
2025-05-07T20:32:14.0360290Z                 op = torch.compile(op)
2025-05-07T20:32:14.0360596Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.0360884Z     
2025-05-07T20:32:14.0361087Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.0361255Z 
2025-05-07T20:32:14.0361357Z moe/activation_test.py:117: 
2025-05-07T20:32:14.0361665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.0362009Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.0362301Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.0362998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.0363684Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.0364223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.0364895Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.0365565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.0366097Z     kernel = self.compile(
2025-05-07T20:32:14.0366640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.0367290Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.0367680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.0367915Z 
2025-05-07T20:32:14.0368126Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a597fad80>
2025-05-07T20:32:14.0369197Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.0370581Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43952700>}
2025-05-07T20:32:14.0371897Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.0372898Z context = <triton._C.libtriton.ir.context object at 0x7f6a423dac70>
2025-05-07T20:32:14.0373262Z 
2025-05-07T20:32:14.0373428Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.0373943Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.0374408Z                            module_map=module_map)
2025-05-07T20:32:14.0374766Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.0375120Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.0375384Z E       ^
2025-05-07T20:32:14.0375979Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.0376431Z 
2025-05-07T20:32:14.0376842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.0377347Z 
2025-05-07T20:32:14.0377452Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.0377862Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.0378369Z     T=16384,
2025-05-07T20:32:14.0378568Z     D=5120,
2025-05-07T20:32:14.0378765Z     scale_ub=1200.0,
2025-05-07T20:32:14.0378987Z     contiguous=True,
2025-05-07T20:32:14.0379210Z     compiled=True,
2025-05-07T20:32:14.0379418Z )
2025-05-07T20:32:14.0379731Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.0380228Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.0380506Z 
2025-05-07T20:32:14.0380586Z     @given(
2025-05-07T20:32:14.0380824Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.0381133Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.0381439Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.0381784Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.0382107Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.0382401Z     )
2025-05-07T20:32:14.0382755Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.0383198Z     def test_silu_mul_quant(
2025-05-07T20:32:14.0383493Z         self,
2025-05-07T20:32:14.0383785Z         T: int,
2025-05-07T20:32:14.0384099Z         D: int,
2025-05-07T20:32:14.0384425Z         scale_ub: Optional[float],
2025-05-07T20:32:14.0384816Z         contiguous: bool,
2025-05-07T20:32:14.0385171Z         compiled: bool,
2025-05-07T20:32:14.0385489Z     ) -> None:
2025-05-07T20:32:14.0385790Z         torch.manual_seed(2025)
2025-05-07T20:32:14.0386168Z     
2025-05-07T20:32:14.0386596Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.0387162Z     
2025-05-07T20:32:14.0387468Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.0387925Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.0388433Z         x = x_sign * x_clamp
2025-05-07T20:32:14.0388824Z         x0 = x[:, :D]
2025-05-07T20:32:14.0389165Z         x1 = x[:, D:]
2025-05-07T20:32:14.0389520Z     
2025-05-07T20:32:14.0389824Z         if contiguous:
2025-05-07T20:32:14.0390200Z             x0 = x0.contiguous()
2025-05-07T20:32:14.0390617Z             x1 = x1.contiguous()
2025-05-07T20:32:14.0391073Z     
2025-05-07T20:32:14.0391365Z         if scale_ub is not None:
2025-05-07T20:32:14.0391769Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.0392235Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.0392607Z             )
2025-05-07T20:32:14.0392811Z         else:
2025-05-07T20:32:14.0393035Z             scale_ub_tensor = None
2025-05-07T20:32:14.0393301Z     
2025-05-07T20:32:14.0393546Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.0393856Z             op = silu_mul_quant
2025-05-07T20:32:14.0394108Z             if compiled:
2025-05-07T20:32:14.0394363Z                 op = torch.compile(op)
2025-05-07T20:32:14.0394657Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.0394939Z     
2025-05-07T20:32:14.0395135Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.0395304Z 
2025-05-07T20:32:14.0395412Z moe/activation_test.py:117: 
2025-05-07T20:32:14.0395706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.0396043Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.0396330Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.0396893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.0397583Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.0398244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.0398929Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.0399454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.0400127Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.0400866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.0401390Z     kernel = self.compile(
2025-05-07T20:32:14.0401936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.0402594Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.0403005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.0403231Z 
2025-05-07T20:32:14.0403440Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a596e6db0>
2025-05-07T20:32:14.0404513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.0405885Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a58561e40>}
2025-05-07T20:32:14.0407212Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.0408220Z context = <triton._C.libtriton.ir.context object at 0x7f6a431a6530>
2025-05-07T20:32:14.0408515Z 
2025-05-07T20:32:14.0408689Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.0409214Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.0409687Z                            module_map=module_map)
2025-05-07T20:32:14.0410057Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.0410422Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.0410695Z E       ^
2025-05-07T20:32:14.0411162Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.0411618Z 
2025-05-07T20:32:14.0412029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.0412538Z 
2025-05-07T20:32:14.2115133Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2115792Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2116417Z     T=16384,
2025-05-07T20:32:14.2116686Z     D=5120,
2025-05-07T20:32:14.2116962Z     scale_ub=None,
2025-05-07T20:32:14.2117412Z     contiguous=False,
2025-05-07T20:32:14.2117876Z     compiled=True,
2025-05-07T20:32:14.2118292Z )
2025-05-07T20:32:14.2118936Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2119916Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.2120485Z 
2025-05-07T20:32:14.2120647Z     @given(
2025-05-07T20:32:14.2121118Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2121733Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2122349Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2123003Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2123648Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2124210Z     )
2025-05-07T20:32:14.2125212Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2126096Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2126570Z         self,
2025-05-07T20:32:14.2126960Z         T: int,
2025-05-07T20:32:14.2127341Z         D: int,
2025-05-07T20:32:14.2127581Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2127846Z         contiguous: bool,
2025-05-07T20:32:14.2128085Z         compiled: bool,
2025-05-07T20:32:14.2128316Z     ) -> None:
2025-05-07T20:32:14.2128675Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2128920Z     
2025-05-07T20:32:14.2129198Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2129533Z     
2025-05-07T20:32:14.2129729Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2130023Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2130327Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2130574Z         x0 = x[:, :D]
2025-05-07T20:32:14.2130793Z         x1 = x[:, D:]
2025-05-07T20:32:14.2131003Z     
2025-05-07T20:32:14.2131195Z         if contiguous:
2025-05-07T20:32:14.2131431Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2131691Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2131924Z     
2025-05-07T20:32:14.2132120Z         if scale_ub is not None:
2025-05-07T20:32:14.2132395Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2132720Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2133103Z             )
2025-05-07T20:32:14.2133300Z         else:
2025-05-07T20:32:14.2133507Z             scale_ub_tensor = None
2025-05-07T20:32:14.2133762Z     
2025-05-07T20:32:14.2133998Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2134302Z             op = silu_mul_quant
2025-05-07T20:32:14.2134553Z             if compiled:
2025-05-07T20:32:14.2134802Z                 op = torch.compile(op)
2025-05-07T20:32:14.2135090Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2135362Z     
2025-05-07T20:32:14.2135561Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2135722Z 
2025-05-07T20:32:14.2135826Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2136116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2136446Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2136723Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2137274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2137837Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2138485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2139164Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2139686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2140362Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2141017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2141537Z     kernel = self.compile(
2025-05-07T20:32:14.2142075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2142717Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2143116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2143340Z 
2025-05-07T20:32:14.2143544Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a597f8a70>
2025-05-07T20:32:14.2144607Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2146037Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a59321e40>}
2025-05-07T20:32:14.2147362Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2148383Z context = <triton._C.libtriton.ir.context object at 0x7f6a42d78ff0>
2025-05-07T20:32:14.2148745Z 
2025-05-07T20:32:14.2148910Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2149426Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2149892Z                            module_map=module_map)
2025-05-07T20:32:14.2150248Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2150607Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2150879Z E       ^
2025-05-07T20:32:14.2151340Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2151788Z 
2025-05-07T20:32:14.2152199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2152705Z 
2025-05-07T20:32:14.2152811Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2153227Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2153618Z     T=2048,
2025-05-07T20:32:14.2153809Z     D=5120,
2025-05-07T20:32:14.2154005Z     scale_ub=None,
2025-05-07T20:32:14.2154218Z     contiguous=False,
2025-05-07T20:32:14.2154446Z     compiled=True,
2025-05-07T20:32:14.2154655Z )
2025-05-07T20:32:14.5072339Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.5073097Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.5073477Z 
2025-05-07T20:32:14.5073608Z     @given(
2025-05-07T20:32:14.5073938Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.5074384Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.5074709Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.5075047Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.5075390Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.5075696Z     )
2025-05-07T20:32:14.5076046Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.5076493Z     def test_silu_mul_quant(
2025-05-07T20:32:14.5076747Z         self,
2025-05-07T20:32:14.5076946Z         T: int,
2025-05-07T20:32:14.5077163Z         D: int,
2025-05-07T20:32:14.5077397Z         scale_ub: Optional[float],
2025-05-07T20:32:14.5077684Z         contiguous: bool,
2025-05-07T20:32:14.5077937Z         compiled: bool,
2025-05-07T20:32:14.5078177Z     ) -> None:
2025-05-07T20:32:14.5078415Z         torch.manual_seed(2025)
2025-05-07T20:32:14.5078669Z     
2025-05-07T20:32:14.5078952Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.5079300Z     
2025-05-07T20:32:14.5087459Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.5087783Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.5088105Z         x = x_sign * x_clamp
2025-05-07T20:32:14.5088360Z         x0 = x[:, :D]
2025-05-07T20:32:14.5088590Z         x1 = x[:, D:]
2025-05-07T20:32:14.5088801Z     
2025-05-07T20:32:14.5089000Z         if contiguous:
2025-05-07T20:32:14.5089241Z             x0 = x0.contiguous()
2025-05-07T20:32:14.5089508Z             x1 = x1.contiguous()
2025-05-07T20:32:14.5089743Z     
2025-05-07T20:32:14.5089941Z         if scale_ub is not None:
2025-05-07T20:32:14.5090217Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.5090553Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.5090867Z             )
2025-05-07T20:32:14.5091234Z         else:
2025-05-07T20:32:14.5091446Z             scale_ub_tensor = None
2025-05-07T20:32:14.5091705Z     
2025-05-07T20:32:14.5091943Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.5092256Z             op = silu_mul_quant
2025-05-07T20:32:14.5092519Z             if compiled:
2025-05-07T20:32:14.5092770Z                 op = torch.compile(op)
2025-05-07T20:32:14.5093120Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.5093523Z     
2025-05-07T20:32:14.5093727Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.5093894Z 
2025-05-07T20:32:14.5094002Z moe/activation_test.py:117: 
2025-05-07T20:32:14.5094295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.5094625Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.5094907Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.5095468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.5096029Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.5096683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.5097362Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.5097899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.5098580Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.5099240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.5099763Z     kernel = self.compile(
2025-05-07T20:32:14.5100303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.5100955Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.5101359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.5101587Z 
2025-05-07T20:32:14.5101794Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a596e7b60>
2025-05-07T20:32:14.5102862Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.5104226Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a59320d60>}
2025-05-07T20:32:14.5105548Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.5106567Z context = <triton._C.libtriton.ir.context object at 0x7f6a42d0f5b0>
2025-05-07T20:32:14.5106862Z 
2025-05-07T20:32:14.5107030Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.5107565Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.5108033Z                            module_map=module_map)
2025-05-07T20:32:14.5108396Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.5108761Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.5109027Z E       ^
2025-05-07T20:32:14.5109487Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.5109941Z 
2025-05-07T20:32:14.5110354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.5110871Z 
2025-05-07T20:32:14.5110976Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.5111492Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.5111886Z     T=2048,
2025-05-07T20:32:14.5112086Z     D=5120,
2025-05-07T20:32:14.5112289Z     scale_ub=1200.0,
2025-05-07T20:32:14.5112511Z     contiguous=False,
2025-05-07T20:32:14.5112747Z     compiled=True,
2025-05-07T20:32:14.5112958Z )
2025-05-07T20:32:14.5113272Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.5113851Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.5114130Z 
2025-05-07T20:32:14.5114210Z     @given(
2025-05-07T20:32:14.5114449Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.5114754Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.5115065Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.5115401Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.5115752Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.5116033Z     )
2025-05-07T20:32:14.5116397Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.5116837Z     def test_silu_mul_quant(
2025-05-07T20:32:14.5117077Z         self,
2025-05-07T20:32:14.5117275Z         T: int,
2025-05-07T20:32:14.5117489Z         D: int,
2025-05-07T20:32:14.5117731Z         scale_ub: Optional[float],
2025-05-07T20:32:14.5118047Z         contiguous: bool,
2025-05-07T20:32:14.5118300Z         compiled: bool,
2025-05-07T20:32:14.5118529Z     ) -> None:
2025-05-07T20:32:14.5118744Z         torch.manual_seed(2025)
2025-05-07T20:32:14.5118998Z     
2025-05-07T20:32:14.5119279Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.5119621Z     
2025-05-07T20:32:14.5119823Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.5120117Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.5120422Z         x = x_sign * x_clamp
2025-05-07T20:32:14.5120666Z         x0 = x[:, :D]
2025-05-07T20:32:14.5120895Z         x1 = x[:, D:]
2025-05-07T20:32:14.5121099Z     
2025-05-07T20:32:14.5121291Z         if contiguous:
2025-05-07T20:32:14.5121523Z             x0 = x0.contiguous()
2025-05-07T20:32:14.5121772Z             x1 = x1.contiguous()
2025-05-07T20:32:14.5122015Z     
2025-05-07T20:32:14.5122210Z         if scale_ub is not None:
2025-05-07T20:32:14.5122474Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.5122816Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.5123130Z             )
2025-05-07T20:32:14.5123327Z         else:
2025-05-07T20:32:14.5123532Z             scale_ub_tensor = None
2025-05-07T20:32:14.5123789Z     
2025-05-07T20:32:14.5124021Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.5124334Z             op = silu_mul_quant
2025-05-07T20:32:14.5124588Z             if compiled:
2025-05-07T20:32:14.5124841Z                 op = torch.compile(op)
2025-05-07T20:32:14.5125130Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.5125415Z     
2025-05-07T20:32:14.5125608Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.5125771Z 
2025-05-07T20:32:14.5125870Z moe/activation_test.py:117: 
2025-05-07T20:32:14.5126171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.5126500Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.5126775Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.5127332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.5127933Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.5128586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.5129260Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.5129794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.5130555Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.5131210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.5131723Z     kernel = self.compile(
2025-05-07T20:32:14.5132262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.5132985Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.5133447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.5133678Z 
2025-05-07T20:32:14.5133883Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a70352d20>
2025-05-07T20:32:14.5134949Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.5136301Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a59031a80>}
2025-05-07T20:32:14.5137679Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.5138693Z context = <triton._C.libtriton.ir.context object at 0x7f6a423cfaf0>
2025-05-07T20:32:14.5138982Z 
2025-05-07T20:32:14.5139151Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.5139665Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.5140136Z                            module_map=module_map)
2025-05-07T20:32:14.5140498Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.5140861Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.5141121Z E       ^
2025-05-07T20:32:14.5141573Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.5142025Z 
2025-05-07T20:32:14.5142439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.5142951Z 
2025-05-07T20:32:14.6863678Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6864958Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6865962Z     T=4096,
2025-05-07T20:32:14.6866358Z     D=5120,
2025-05-07T20:32:14.6866743Z     scale_ub=1200.0,
2025-05-07T20:32:14.6867196Z     contiguous=True,
2025-05-07T20:32:14.6867574Z     compiled=True,
2025-05-07T20:32:14.6867787Z )
2025-05-07T20:32:14.6868108Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6868620Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.6868891Z 
2025-05-07T20:32:14.6868981Z     @given(
2025-05-07T20:32:14.6869216Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6869535Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6869934Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6870386Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6870752Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6871042Z     )
2025-05-07T20:32:14.6871398Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6871835Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6872083Z         self,
2025-05-07T20:32:14.6872275Z         T: int,
2025-05-07T20:32:14.6872472Z         D: int,
2025-05-07T20:32:14.6872695Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6872969Z         contiguous: bool,
2025-05-07T20:32:14.6873211Z         compiled: bool,
2025-05-07T20:32:14.6873634Z     ) -> None:
2025-05-07T20:32:14.6873860Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6874106Z     
2025-05-07T20:32:14.6874384Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6874728Z     
2025-05-07T20:32:14.6874923Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6875217Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6875646Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6875896Z         x0 = x[:, :D]
2025-05-07T20:32:14.6876107Z         x1 = x[:, D:]
2025-05-07T20:32:14.6876314Z     
2025-05-07T20:32:14.6876500Z         if contiguous:
2025-05-07T20:32:14.6876738Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6876996Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6877229Z     
2025-05-07T20:32:14.6877417Z         if scale_ub is not None:
2025-05-07T20:32:14.6877685Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6878020Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6878323Z             )
2025-05-07T20:32:14.6878514Z         else:
2025-05-07T20:32:14.6878725Z             scale_ub_tensor = None
2025-05-07T20:32:14.6878977Z     
2025-05-07T20:32:14.6879213Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6879536Z             op = silu_mul_quant
2025-05-07T20:32:14.6879782Z             if compiled:
2025-05-07T20:32:14.6880042Z                 op = torch.compile(op)
2025-05-07T20:32:14.6880350Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6880621Z     
2025-05-07T20:32:14.6880829Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6880995Z 
2025-05-07T20:32:14.6881101Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6881402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6881736Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6882021Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6882586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.6883146Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.6883802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6884481Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6885011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6885690Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6886348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6886877Z     kernel = self.compile(
2025-05-07T20:32:14.6887415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6888069Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6888463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6888689Z 
2025-05-07T20:32:14.6888901Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a70e4db50>
2025-05-07T20:32:14.6889966Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6891338Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a59033420>}
2025-05-07T20:32:14.6892660Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6893869Z context = <triton._C.libtriton.ir.context object at 0x7f6a423b42b0>
2025-05-07T20:32:14.6894157Z 
2025-05-07T20:32:14.6894333Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6894848Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6895317Z                            module_map=module_map)
2025-05-07T20:32:14.6895756Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6896107Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6896367Z E       ^
2025-05-07T20:32:14.6896831Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6897272Z 
2025-05-07T20:32:14.6897694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6898204Z 
2025-05-07T20:32:14.6898320Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6898733Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6899143Z     T=128,
2025-05-07T20:32:14.6899333Z     D=5120,
2025-05-07T20:32:14.6899532Z     scale_ub=1200.0,
2025-05-07T20:32:14.6899762Z     contiguous=False,
2025-05-07T20:32:14.6899988Z     compiled=True,
2025-05-07T20:32:14.6900196Z )
2025-05-07T20:32:14.7907621Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7908846Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.7909531Z 
2025-05-07T20:32:14.7909755Z     @given(
2025-05-07T20:32:14.7910324Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7911013Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7911568Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7912162Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7912770Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7913294Z     )
2025-05-07T20:32:14.7913922Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7914725Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7915167Z         self,
2025-05-07T20:32:14.7915519Z         T: int,
2025-05-07T20:32:14.7915883Z         D: int,
2025-05-07T20:32:14.7916284Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7916783Z         contiguous: bool,
2025-05-07T20:32:14.7917214Z         compiled: bool,
2025-05-07T20:32:14.7917625Z     ) -> None:
2025-05-07T20:32:14.7918009Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7918292Z     
2025-05-07T20:32:14.7918571Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7918915Z     
2025-05-07T20:32:14.7919112Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.7919411Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.7919725Z         x = x_sign * x_clamp
2025-05-07T20:32:14.7919969Z         x0 = x[:, :D]
2025-05-07T20:32:14.7920190Z         x1 = x[:, D:]
2025-05-07T20:32:14.7920402Z     
2025-05-07T20:32:14.7920584Z         if contiguous:
2025-05-07T20:32:14.7920821Z             x0 = x0.contiguous()
2025-05-07T20:32:14.7921080Z             x1 = x1.contiguous()
2025-05-07T20:32:14.7921314Z     
2025-05-07T20:32:14.7921513Z         if scale_ub is not None:
2025-05-07T20:32:14.7921790Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.7922125Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.7922432Z             )
2025-05-07T20:32:14.7922633Z         else:
2025-05-07T20:32:14.7922852Z             scale_ub_tensor = None
2025-05-07T20:32:14.7923101Z     
2025-05-07T20:32:14.7923340Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.7923661Z             op = silu_mul_quant
2025-05-07T20:32:14.7923918Z             if compiled:
2025-05-07T20:32:14.7924173Z                 op = torch.compile(op)
2025-05-07T20:32:14.7924689Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7924966Z     
2025-05-07T20:32:14.7925167Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.7925332Z 
2025-05-07T20:32:14.7925448Z moe/activation_test.py:117: 
2025-05-07T20:32:14.7925742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7926082Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.7926480Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7927040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.7927596Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.7928260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.7928942Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.7929488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.7930162Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.7930817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.7931348Z     kernel = self.compile(
2025-05-07T20:32:14.7931885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.7932539Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.7932941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7933250Z 
2025-05-07T20:32:14.7933463Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43b0ec90>
2025-05-07T20:32:14.7934532Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.7935885Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a59032c00>}
2025-05-07T20:32:14.7937208Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.7938227Z context = <triton._C.libtriton.ir.context object at 0x7f6a4239efb0>
2025-05-07T20:32:14.7938512Z 
2025-05-07T20:32:14.7938678Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.7939196Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.7939662Z                            module_map=module_map)
2025-05-07T20:32:14.7940036Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.7940389Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.7940662Z E       ^
2025-05-07T20:32:14.7941129Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.7941575Z 
2025-05-07T20:32:14.7941987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.7942501Z 
2025-05-07T20:32:14.7942608Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7943023Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7943425Z     T=16384,
2025-05-07T20:32:14.7943616Z     D=7168,
2025-05-07T20:32:14.7943819Z     scale_ub=1200.0,
2025-05-07T20:32:14.7944051Z     contiguous=True,
2025-05-07T20:32:14.7944272Z     compiled=True,
2025-05-07T20:32:14.7944484Z )
2025-05-07T20:32:14.7944896Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7945387Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.7945669Z 
2025-05-07T20:32:14.7945751Z     @given(
2025-05-07T20:32:14.7945990Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7946337Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7946651Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7947056Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7947391Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7947692Z     )
2025-05-07T20:32:14.7948042Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7948490Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7948746Z         self,
2025-05-07T20:32:14.7948940Z         T: int,
2025-05-07T20:32:14.7949143Z         D: int,
2025-05-07T20:32:14.7949372Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7949652Z         contiguous: bool,
2025-05-07T20:32:14.7949904Z         compiled: bool,
2025-05-07T20:32:14.7950131Z     ) -> None:
2025-05-07T20:32:14.7950345Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7950596Z     
2025-05-07T20:32:14.7950872Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7951215Z     
2025-05-07T20:32:14.7951412Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.7951711Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.7952026Z         x = x_sign * x_clamp
2025-05-07T20:32:14.7952266Z         x0 = x[:, :D]
2025-05-07T20:32:14.7952488Z         x1 = x[:, D:]
2025-05-07T20:32:14.7952698Z     
2025-05-07T20:32:14.7952883Z         if contiguous:
2025-05-07T20:32:14.7953125Z             x0 = x0.contiguous()
2025-05-07T20:32:14.7953383Z             x1 = x1.contiguous()
2025-05-07T20:32:14.7953620Z     
2025-05-07T20:32:14.7953817Z         if scale_ub is not None:
2025-05-07T20:32:14.7954097Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.7954432Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.7954740Z             )
2025-05-07T20:32:14.7954943Z         else:
2025-05-07T20:32:14.7955150Z             scale_ub_tensor = None
2025-05-07T20:32:14.7955407Z     
2025-05-07T20:32:14.7955643Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.7955951Z             op = silu_mul_quant
2025-05-07T20:32:14.7956215Z             if compiled:
2025-05-07T20:32:14.7956471Z                 op = torch.compile(op)
2025-05-07T20:32:14.7956775Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7957044Z     
2025-05-07T20:32:14.7957244Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.7957411Z 
2025-05-07T20:32:14.7957515Z moe/activation_test.py:117: 
2025-05-07T20:32:14.7957809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7958139Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.7958425Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7958977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.7959994Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.7960667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.7961358Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.7961900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.7962582Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.7963248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.7963787Z     kernel = self.compile(
2025-05-07T20:32:14.7964454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.7971469Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.7971903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7972136Z 
2025-05-07T20:32:14.7972350Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43ed0b90>
2025-05-07T20:32:14.7973479Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.7975067Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a59b67740>}
2025-05-07T20:32:14.7976416Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.7977429Z context = <triton._C.libtriton.ir.context object at 0x7f6a42553f70>
2025-05-07T20:32:14.7977728Z 
2025-05-07T20:32:14.7977900Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.7978423Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.7978910Z                            module_map=module_map)
2025-05-07T20:32:14.7979278Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.7979636Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.7979903Z E       ^
2025-05-07T20:32:14.7980368Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.7980818Z 
2025-05-07T20:32:14.7981234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.7981751Z 
2025-05-07T20:32:14.9194849Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.9195444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.9196060Z     T=16384,
2025-05-07T20:32:14.9196331Z     D=5120,
2025-05-07T20:32:14.9196608Z     scale_ub=1200.0,
2025-05-07T20:32:14.9196919Z     contiguous=True,
2025-05-07T20:32:14.9197227Z     compiled=False,
2025-05-07T20:32:14.9197544Z )
2025-05-07T20:32:14.9197885Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.9198395Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.9198674Z 
2025-05-07T20:32:14.9198759Z     @given(
2025-05-07T20:32:14.9198995Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.9199311Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.9199615Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.9199959Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.9200290Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.9200578Z     )
2025-05-07T20:32:14.9200936Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.9201378Z     def test_silu_mul_quant(
2025-05-07T20:32:14.9201637Z         self,
2025-05-07T20:32:14.9201842Z         T: int,
2025-05-07T20:32:14.9202056Z         D: int,
2025-05-07T20:32:14.9202291Z         scale_ub: Optional[float],
2025-05-07T20:32:14.9202562Z         contiguous: bool,
2025-05-07T20:32:14.9202815Z         compiled: bool,
2025-05-07T20:32:14.9203053Z     ) -> None:
2025-05-07T20:32:14.9203268Z         torch.manual_seed(2025)
2025-05-07T20:32:14.9203515Z     
2025-05-07T20:32:14.9203794Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.9204140Z     
2025-05-07T20:32:14.9204340Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.9204814Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.9205133Z         x = x_sign * x_clamp
2025-05-07T20:32:14.9205386Z         x0 = x[:, :D]
2025-05-07T20:32:14.9205616Z         x1 = x[:, D:]
2025-05-07T20:32:14.9205832Z     
2025-05-07T20:32:14.9206031Z         if contiguous:
2025-05-07T20:32:14.9206276Z             x0 = x0.contiguous()
2025-05-07T20:32:14.9206533Z             x1 = x1.contiguous()
2025-05-07T20:32:14.9206786Z     
2025-05-07T20:32:14.9207108Z         if scale_ub is not None:
2025-05-07T20:32:14.9207399Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.9207737Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.9208057Z             )
2025-05-07T20:32:14.9208263Z         else:
2025-05-07T20:32:14.9208478Z             scale_ub_tensor = None
2025-05-07T20:32:14.9208740Z     
2025-05-07T20:32:14.9208982Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.9209295Z             op = silu_mul_quant
2025-05-07T20:32:14.9209571Z             if compiled:
2025-05-07T20:32:14.9209817Z                 op = torch.compile(op)
2025-05-07T20:32:14.9210111Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.9210383Z     
2025-05-07T20:32:14.9210582Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.9210751Z 
2025-05-07T20:32:14.9210850Z moe/activation_test.py:117: 
2025-05-07T20:32:14.9211158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.9211510Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.9211792Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.9212476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.9213272Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.9213809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.9214488Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.9215146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.9215681Z     kernel = self.compile(
2025-05-07T20:32:14.9216228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.9216871Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.9217275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.9217502Z 
2025-05-07T20:32:14.9217714Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43ecc830>
2025-05-07T20:32:14.9218803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.9220267Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a596c7560>}
2025-05-07T20:32:14.9221604Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.9222629Z context = <triton._C.libtriton.ir.context object at 0x7f6a42049f70>
2025-05-07T20:32:14.9222917Z 
2025-05-07T20:32:14.9223094Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.9223608Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.9224080Z                            module_map=module_map)
2025-05-07T20:32:14.9224452Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.9224813Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.9225167Z E       ^
2025-05-07T20:32:14.9225667Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.9226127Z 
2025-05-07T20:32:14.9226540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.9227044Z 
2025-05-07T20:32:14.9227156Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.9227657Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.9228059Z     T=1,
2025-05-07T20:32:14.9228252Z     D=7168,
2025-05-07T20:32:14.9228459Z     scale_ub=1200.0,
2025-05-07T20:32:14.9228686Z     contiguous=False,
2025-05-07T20:32:14.9228925Z     compiled=False,
2025-05-07T20:32:14.9229143Z )
2025-05-07T20:32:14.9229464Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.9230087Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.9230372Z 
2025-05-07T20:32:14.9230488Z     @given(
2025-05-07T20:32:14.9230762Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.9231077Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.9231390Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.9231719Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.9232048Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.9232341Z     )
2025-05-07T20:32:14.9232692Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.9233131Z     def test_silu_mul_quant(
2025-05-07T20:32:14.9233383Z         self,
2025-05-07T20:32:14.9233585Z         T: int,
2025-05-07T20:32:14.9233783Z         D: int,
2025-05-07T20:32:14.9234015Z         scale_ub: Optional[float],
2025-05-07T20:32:14.9234305Z         contiguous: bool,
2025-05-07T20:32:14.9234555Z         compiled: bool,
2025-05-07T20:32:14.9234787Z     ) -> None:
2025-05-07T20:32:14.9235022Z         torch.manual_seed(2025)
2025-05-07T20:32:14.9235272Z     
2025-05-07T20:32:14.9235552Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.9235897Z     
2025-05-07T20:32:14.9236089Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.9236386Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.9236703Z         x = x_sign * x_clamp
2025-05-07T20:32:14.9236955Z         x0 = x[:, :D]
2025-05-07T20:32:14.9237179Z         x1 = x[:, D:]
2025-05-07T20:32:14.9237397Z     
2025-05-07T20:32:14.9237603Z         if contiguous:
2025-05-07T20:32:14.9237836Z             x0 = x0.contiguous()
2025-05-07T20:32:14.9238105Z             x1 = x1.contiguous()
2025-05-07T20:32:14.9238353Z     
2025-05-07T20:32:14.9238540Z         if scale_ub is not None:
2025-05-07T20:32:14.9238817Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.9239159Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.9239468Z             )
2025-05-07T20:32:14.9239673Z         else:
2025-05-07T20:32:14.9239896Z             scale_ub_tensor = None
2025-05-07T20:32:14.9240152Z     
2025-05-07T20:32:14.9240390Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.9240700Z             op = silu_mul_quant
2025-05-07T20:32:14.9240944Z             if compiled:
2025-05-07T20:32:14.9241257Z                 op = torch.compile(op)
2025-05-07T20:32:14.9241633Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.9241913Z     
2025-05-07T20:32:14.9242108Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.9242276Z 
2025-05-07T20:32:14.9242375Z moe/activation_test.py:117: 
2025-05-07T20:32:14.9242676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.9243008Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.9243285Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.9244074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.9244758Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.9245296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.9245974Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.9246624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.9247233Z     kernel = self.compile(
2025-05-07T20:32:14.9247772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.9248420Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.9248813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.9249044Z 
2025-05-07T20:32:14.9249256Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43ed2bd0>
2025-05-07T20:32:14.9250316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.9251669Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a704ddf80>}
2025-05-07T20:32:14.9253075Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.9254112Z context = <triton._C.libtriton.ir.context object at 0x7f6a42150230>
2025-05-07T20:32:14.9254402Z 
2025-05-07T20:32:14.9254567Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.9255086Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.9255549Z                            module_map=module_map)
2025-05-07T20:32:14.9255918Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.9256275Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.9256536Z E       ^
2025-05-07T20:32:14.9256999Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.9257451Z 
2025-05-07T20:32:14.9257864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.9258371Z 
2025-05-07T20:32:15.0998503Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.0999122Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.0999695Z     T=4096,
2025-05-07T20:32:15.0999968Z     D=7168,
2025-05-07T20:32:15.1000245Z     scale_ub=1200.0,
2025-05-07T20:32:15.1000569Z     contiguous=False,
2025-05-07T20:32:15.1000803Z     compiled=True,
2025-05-07T20:32:15.1001023Z )
2025-05-07T20:32:15.1001358Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.1001853Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:15.1002135Z 
2025-05-07T20:32:15.1002220Z     @given(
2025-05-07T20:32:15.1002464Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.1002786Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.1003101Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.1003441Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.1003779Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.1004069Z     )
2025-05-07T20:32:15.1004432Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.1004884Z     def test_silu_mul_quant(
2025-05-07T20:32:15.1005300Z         self,
2025-05-07T20:32:15.1005505Z         T: int,
2025-05-07T20:32:15.1005704Z         D: int,
2025-05-07T20:32:15.1005920Z         scale_ub: Optional[float],
2025-05-07T20:32:15.1006195Z         contiguous: bool,
2025-05-07T20:32:15.1006441Z         compiled: bool,
2025-05-07T20:32:15.1006666Z     ) -> None:
2025-05-07T20:32:15.1006892Z         torch.manual_seed(2025)
2025-05-07T20:32:15.1007138Z     
2025-05-07T20:32:15.1007525Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.1007872Z     
2025-05-07T20:32:15.1008076Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.1008375Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.1008689Z         x = x_sign * x_clamp
2025-05-07T20:32:15.1008938Z         x0 = x[:, :D]
2025-05-07T20:32:15.1009159Z         x1 = x[:, D:]
2025-05-07T20:32:15.1009367Z     
2025-05-07T20:32:15.1009561Z         if contiguous:
2025-05-07T20:32:15.1009803Z             x0 = x0.contiguous()
2025-05-07T20:32:15.1010065Z             x1 = x1.contiguous()
2025-05-07T20:32:15.1010311Z     
2025-05-07T20:32:15.1010518Z         if scale_ub is not None:
2025-05-07T20:32:15.1010789Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.1011124Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.1011442Z             )
2025-05-07T20:32:15.1011634Z         else:
2025-05-07T20:32:15.1011849Z             scale_ub_tensor = None
2025-05-07T20:32:15.1012109Z     
2025-05-07T20:32:15.1012336Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.1012651Z             op = silu_mul_quant
2025-05-07T20:32:15.1012902Z             if compiled:
2025-05-07T20:32:15.1013231Z                 op = torch.compile(op)
2025-05-07T20:32:15.1013525Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.1013800Z     
2025-05-07T20:32:15.1013996Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.1014158Z 
2025-05-07T20:32:15.1014259Z moe/activation_test.py:117: 
2025-05-07T20:32:15.1014561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.1014899Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.1015176Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.1015733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.1016294Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.1016954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.1017628Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.1018158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.1018837Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.1019495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.1020029Z     kernel = self.compile(
2025-05-07T20:32:15.1020571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.1021222Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.1021615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.1021851Z 
2025-05-07T20:32:15.1022058Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43fedb20>
2025-05-07T20:32:15.1023130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.1024611Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a701edb20>}
2025-05-07T20:32:15.1025935Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.1026941Z context = <triton._C.libtriton.ir.context object at 0x7f6a42459fb0>
2025-05-07T20:32:15.1027232Z 
2025-05-07T20:32:15.1027475Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.1027992Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.1028463Z                            module_map=module_map)
2025-05-07T20:32:15.1028829Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.1029184Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.1029453Z E       ^
2025-05-07T20:32:15.1029916Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.1030369Z 
2025-05-07T20:32:15.1030782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.1031294Z 
2025-05-07T20:32:15.1031401Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.1031818Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.1032221Z     T=128,
2025-05-07T20:32:15.1032415Z     D=7168,
2025-05-07T20:32:15.1032613Z     scale_ub=1200.0,
2025-05-07T20:32:15.1032835Z     contiguous=False,
2025-05-07T20:32:15.1033068Z     compiled=True,
2025-05-07T20:32:15.1033281Z )
2025-05-07T20:32:15.1941617Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.1942398Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:15.1942780Z 
2025-05-07T20:32:15.1942895Z     @given(
2025-05-07T20:32:15.1943239Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.1943683Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.1944010Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.1944359Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.1944700Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.1944998Z     )
2025-05-07T20:32:15.1945361Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.1945828Z     def test_silu_mul_quant(
2025-05-07T20:32:15.1946090Z         self,
2025-05-07T20:32:15.1946299Z         T: int,
2025-05-07T20:32:15.1946516Z         D: int,
2025-05-07T20:32:15.1946752Z         scale_ub: Optional[float],
2025-05-07T20:32:15.1947035Z         contiguous: bool,
2025-05-07T20:32:15.1947293Z         compiled: bool,
2025-05-07T20:32:15.1947538Z     ) -> None:
2025-05-07T20:32:15.1947767Z         torch.manual_seed(2025)
2025-05-07T20:32:15.1948030Z     
2025-05-07T20:32:15.1948318Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.1948666Z     
2025-05-07T20:32:15.1948876Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.1949180Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.1949496Z         x = x_sign * x_clamp
2025-05-07T20:32:15.1949755Z         x0 = x[:, :D]
2025-05-07T20:32:15.1949988Z         x1 = x[:, D:]
2025-05-07T20:32:15.1950212Z     
2025-05-07T20:32:15.1950428Z         if contiguous:
2025-05-07T20:32:15.1950675Z             x0 = x0.contiguous()
2025-05-07T20:32:15.1950946Z             x1 = x1.contiguous()
2025-05-07T20:32:15.1951208Z     
2025-05-07T20:32:15.1951417Z         if scale_ub is not None:
2025-05-07T20:32:15.1951703Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.1952058Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.1952385Z             )
2025-05-07T20:32:15.1952595Z         else:
2025-05-07T20:32:15.1952820Z             scale_ub_tensor = None
2025-05-07T20:32:15.1953262Z     
2025-05-07T20:32:15.1953509Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.1953821Z             op = silu_mul_quant
2025-05-07T20:32:15.1954083Z             if compiled:
2025-05-07T20:32:15.1954350Z                 op = torch.compile(op)
2025-05-07T20:32:15.1954650Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.1954931Z     
2025-05-07T20:32:15.1955135Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.1955444Z 
2025-05-07T20:32:15.1955546Z moe/activation_test.py:117: 
2025-05-07T20:32:15.1955847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.1956184Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.1956471Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.1957094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.1957762Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.1958552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.1959556Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.1960094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.1960776Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.1961447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.1961972Z     kernel = self.compile(
2025-05-07T20:32:15.1962523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.1963178Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.1963574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.1963810Z 
2025-05-07T20:32:15.1964024Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43b55bb0>
2025-05-07T20:32:15.1965086Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.1966439Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a70252480>}
2025-05-07T20:32:15.1967765Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.1968767Z context = <triton._C.libtriton.ir.context object at 0x7f6a4247a030>
2025-05-07T20:32:15.1969054Z 
2025-05-07T20:32:15.1969224Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.1969739Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.1970203Z                            module_map=module_map)
2025-05-07T20:32:15.1970558Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.1970913Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.1971175Z E       ^
2025-05-07T20:32:15.1971641Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.1972090Z 
2025-05-07T20:32:15.1972500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.1973085Z 
2025-05-07T20:32:15.1973194Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.1973604Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.1974006Z     T=2048,
2025-05-07T20:32:15.1974348Z     D=7168,
2025-05-07T20:32:15.1974546Z     scale_ub=None,
2025-05-07T20:32:15.1974766Z     contiguous=True,
2025-05-07T20:32:15.1974987Z     compiled=True,
2025-05-07T20:32:15.1975202Z )
2025-05-07T20:32:15.1975519Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.1976008Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:15.1976274Z 
2025-05-07T20:32:15.1976468Z     @given(
2025-05-07T20:32:15.1976706Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.1977024Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.1977331Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.1977662Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.1977998Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.1978288Z     )
2025-05-07T20:32:15.1978646Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.1979091Z     def test_silu_mul_quant(
2025-05-07T20:32:15.1979337Z         self,
2025-05-07T20:32:15.1979541Z         T: int,
2025-05-07T20:32:15.1979732Z         D: int,
2025-05-07T20:32:15.1979954Z         scale_ub: Optional[float],
2025-05-07T20:32:15.1980232Z         contiguous: bool,
2025-05-07T20:32:15.1986968Z         compiled: bool,
2025-05-07T20:32:15.1987226Z     ) -> None:
2025-05-07T20:32:15.1987446Z         torch.manual_seed(2025)
2025-05-07T20:32:15.1987704Z     
2025-05-07T20:32:15.1987975Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.1988319Z     
2025-05-07T20:32:15.1988518Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.1988803Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.1989112Z         x = x_sign * x_clamp
2025-05-07T20:32:15.1989360Z         x0 = x[:, :D]
2025-05-07T20:32:15.1989578Z         x1 = x[:, D:]
2025-05-07T20:32:15.1989783Z     
2025-05-07T20:32:15.1989972Z         if contiguous:
2025-05-07T20:32:15.1990206Z             x0 = x0.contiguous()
2025-05-07T20:32:15.1990459Z             x1 = x1.contiguous()
2025-05-07T20:32:15.1990704Z     
2025-05-07T20:32:15.1990891Z         if scale_ub is not None:
2025-05-07T20:32:15.1991154Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.1991487Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.1991796Z             )
2025-05-07T20:32:15.1991986Z         else:
2025-05-07T20:32:15.1992194Z             scale_ub_tensor = None
2025-05-07T20:32:15.1992443Z     
2025-05-07T20:32:15.1992670Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.1992981Z             op = silu_mul_quant
2025-05-07T20:32:15.1993227Z             if compiled:
2025-05-07T20:32:15.1993468Z                 op = torch.compile(op)
2025-05-07T20:32:15.1993762Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.1994037Z     
2025-05-07T20:32:15.1994221Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.1994400Z 
2025-05-07T20:32:15.1994498Z moe/activation_test.py:117: 
2025-05-07T20:32:15.1994797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.1995125Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.1995397Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.1995950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.1996508Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.1997151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.1997855Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.1998419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.1999086Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.1999845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.2000371Z     kernel = self.compile(
2025-05-07T20:32:15.2000902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.2001539Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.2001930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.2002237Z 
2025-05-07T20:32:15.2002441Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a582408f0>
2025-05-07T20:32:15.2003503Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.2004863Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a42ce5da0>}
2025-05-07T20:32:15.2006178Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.2007189Z context = <triton._C.libtriton.ir.context object at 0x7f6a4212c3f0>
2025-05-07T20:32:15.2007482Z 
2025-05-07T20:32:15.2007644Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.2008148Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.2008605Z                            module_map=module_map)
2025-05-07T20:32:15.2008963Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.2009306Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.2009564Z E       ^
2025-05-07T20:32:15.2010033Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.2010481Z 
2025-05-07T20:32:15.2010888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.2011387Z 
2025-05-07T20:32:15.2632626Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.2633241Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.2633864Z     T=16384,
2025-05-07T20:32:15.2634119Z     D=5120,
2025-05-07T20:32:15.2634402Z     scale_ub=None,
2025-05-07T20:32:15.2634704Z     contiguous=False,
2025-05-07T20:32:15.2634970Z     compiled=False,
2025-05-07T20:32:15.2635176Z )
2025-05-07T20:32:15.2635490Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.2635988Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:15.2636265Z 
2025-05-07T20:32:15.2636341Z     @given(
2025-05-07T20:32:15.2636587Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.2636898Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.2637194Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.2637520Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.2637847Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.2638127Z     )
2025-05-07T20:32:15.2638479Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.2638913Z     def test_silu_mul_quant(
2025-05-07T20:32:15.2639149Z         self,
2025-05-07T20:32:15.2639337Z         T: int,
2025-05-07T20:32:15.2639539Z         D: int,
2025-05-07T20:32:15.2639760Z         scale_ub: Optional[float],
2025-05-07T20:32:15.2640028Z         contiguous: bool,
2025-05-07T20:32:15.2640264Z         compiled: bool,
2025-05-07T20:32:15.2640486Z     ) -> None:
2025-05-07T20:32:15.2640695Z         torch.manual_seed(2025)
2025-05-07T20:32:15.2641119Z     
2025-05-07T20:32:15.2641406Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.2641747Z     
2025-05-07T20:32:15.2641941Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.2642230Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.2644226Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.2646176Z 
2025-05-07T20:32:15.2646301Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:15.2646522Z 
2025-05-07T20:32:15.2646635Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.2647048Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.2647451Z     T=4096,
2025-05-07T20:32:15.2647645Z     D=7168,
2025-05-07T20:32:15.2647840Z     scale_ub=1200.0,
2025-05-07T20:32:15.2648072Z     contiguous=True,
2025-05-07T20:32:15.2648297Z     compiled=True,
2025-05-07T20:32:15.2648503Z )
2025-05-07T20:32:15.2648821Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.2649326Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:15.2649596Z 
2025-05-07T20:32:15.2649678Z     @given(
2025-05-07T20:32:15.2649916Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.2650231Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.2650533Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.2650876Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.2651212Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.2651501Z     )
2025-05-07T20:32:15.2651844Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.2652289Z     def test_silu_mul_quant(
2025-05-07T20:32:15.2652542Z         self,
2025-05-07T20:32:15.2652732Z         T: int,
2025-05-07T20:32:15.2652942Z         D: int,
2025-05-07T20:32:15.2653238Z         scale_ub: Optional[float],
2025-05-07T20:32:15.2653511Z         contiguous: bool,
2025-05-07T20:32:15.2653755Z         compiled: bool,
2025-05-07T20:32:15.2653981Z     ) -> None:
2025-05-07T20:32:15.2654198Z         torch.manual_seed(2025)
2025-05-07T20:32:15.2654448Z     
2025-05-07T20:32:15.2654721Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.2655058Z     
2025-05-07T20:32:15.2655261Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.2655563Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.2657533Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.2659543Z 
2025-05-07T20:32:15.2659675Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:15.2659884Z 
2025-05-07T20:32:15.2659993Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.2660402Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.2660799Z     T=16384,
2025-05-07T20:32:15.2660994Z     D=7168,
2025-05-07T20:32:15.2661321Z     scale_ub=None,
2025-05-07T20:32:15.2661538Z     contiguous=False,
2025-05-07T20:32:15.2661769Z     compiled=False,
2025-05-07T20:32:15.2661975Z )
2025-05-07T20:32:15.2662292Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.2662793Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:15.2663068Z 
2025-05-07T20:32:15.2663144Z     @given(
2025-05-07T20:32:15.2663489Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.2663810Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.2664115Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.2664446Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.2664777Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.2665060Z     )
2025-05-07T20:32:15.2665405Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.2665856Z     def test_silu_mul_quant(
2025-05-07T20:32:15.2666100Z         self,
2025-05-07T20:32:15.2666297Z         T: int,
2025-05-07T20:32:15.2666501Z         D: int,
2025-05-07T20:32:15.2666720Z         scale_ub: Optional[float],
2025-05-07T20:32:15.2666997Z         contiguous: bool,
2025-05-07T20:32:15.2667243Z         compiled: bool,
2025-05-07T20:32:15.2667465Z     ) -> None:
2025-05-07T20:32:15.2667682Z         torch.manual_seed(2025)
2025-05-07T20:32:15.2667932Z     
2025-05-07T20:32:15.2668196Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.2670223Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.2672060Z 
2025-05-07T20:32:15.2672179Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.2672393Z 
2025-05-07T20:32:15.2672502Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.2672915Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.2673315Z     T=2048,
2025-05-07T20:32:15.2673510Z     D=7168,
2025-05-07T20:32:15.2673708Z     scale_ub=1200.0,
2025-05-07T20:32:15.2673926Z     contiguous=True,
2025-05-07T20:32:15.2674151Z     compiled=True,
2025-05-07T20:32:15.2674355Z )
2025-05-07T20:32:15.2674666Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.2675158Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:15.2675422Z 
2025-05-07T20:32:15.2675505Z     @given(
2025-05-07T20:32:15.2675734Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.2676049Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.2676356Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.2676681Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.2676999Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.2677285Z     )
2025-05-07T20:32:15.2677640Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.2678076Z     def test_silu_mul_quant(
2025-05-07T20:32:15.2678311Z         self,
2025-05-07T20:32:15.2678512Z         T: int,
2025-05-07T20:32:15.2678708Z         D: int,
2025-05-07T20:32:15.2678929Z         scale_ub: Optional[float],
2025-05-07T20:32:15.2679205Z         contiguous: bool,
2025-05-07T20:32:15.2679439Z         compiled: bool,
2025-05-07T20:32:15.2679671Z     ) -> None:
2025-05-07T20:32:15.2679882Z         torch.manual_seed(2025)
2025-05-07T20:32:15.2680118Z     
2025-05-07T20:32:15.2680468Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.2680807Z     
2025-05-07T20:32:15.2681010Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.2681300Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.2683257Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.2685179Z 
2025-05-07T20:32:15.2685300Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:15.2685511Z 
2025-05-07T20:32:15.2685627Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.2686036Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.2686436Z     T=2048,
2025-05-07T20:32:15.2686636Z     D=7168,
2025-05-07T20:32:15.2686842Z     scale_ub=None,
2025-05-07T20:32:15.2687056Z     contiguous=True,
2025-05-07T20:32:15.2687288Z     compiled=False,
2025-05-07T20:32:15.2687496Z )
2025-05-07T20:32:15.3817479Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.3818239Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:15.3818617Z 
2025-05-07T20:32:15.3818729Z     @given(
2025-05-07T20:32:15.3819044Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.3819360Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.3819664Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.3819988Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.3820319Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.3820609Z     )
2025-05-07T20:32:15.3820959Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.3821396Z     def test_silu_mul_quant(
2025-05-07T20:32:15.3821629Z         self,
2025-05-07T20:32:15.3821826Z         T: int,
2025-05-07T20:32:15.3822021Z         D: int,
2025-05-07T20:32:15.3822232Z         scale_ub: Optional[float],
2025-05-07T20:32:15.3822513Z         contiguous: bool,
2025-05-07T20:32:15.3822755Z         compiled: bool,
2025-05-07T20:32:15.3822976Z     ) -> None:
2025-05-07T20:32:15.3823189Z         torch.manual_seed(2025)
2025-05-07T20:32:15.3823427Z     
2025-05-07T20:32:15.3823692Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.3824043Z     
2025-05-07T20:32:15.3824243Z >       x_sign = torch.sign(x)
2025-05-07T20:32:15.3826163Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.3827988Z 
2025-05-07T20:32:15.3828111Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:15.3828319Z 
2025-05-07T20:32:15.3828420Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.3828834Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.3829230Z     T=1,
2025-05-07T20:32:15.3829406Z     D=7168,
2025-05-07T20:32:15.3829603Z     scale_ub=1200.0,
2025-05-07T20:32:15.3829831Z     contiguous=True,
2025-05-07T20:32:15.3830589Z     compiled=False,
2025-05-07T20:32:15.3830804Z )
2025-05-07T20:32:15.3831123Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.3831597Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:15.3831868Z 
2025-05-07T20:32:15.3831949Z     @given(
2025-05-07T20:32:15.3832185Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.3832498Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.3832913Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.3833237Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.3833565Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.3833840Z     )
2025-05-07T20:32:15.3834183Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.3834625Z     def test_silu_mul_quant(
2025-05-07T20:32:15.3834859Z         self,
2025-05-07T20:32:15.3835056Z         T: int,
2025-05-07T20:32:15.3835263Z         D: int,
2025-05-07T20:32:15.3835474Z         scale_ub: Optional[float],
2025-05-07T20:32:15.3835747Z         contiguous: bool,
2025-05-07T20:32:15.3835989Z         compiled: bool,
2025-05-07T20:32:15.3836217Z     ) -> None:
2025-05-07T20:32:15.3836425Z         torch.manual_seed(2025)
2025-05-07T20:32:15.3836672Z     
2025-05-07T20:32:15.3836941Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.3837276Z     
2025-05-07T20:32:15.3837474Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.3837762Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.3838066Z         x = x_sign * x_clamp
2025-05-07T20:32:15.3838309Z         x0 = x[:, :D]
2025-05-07T20:32:15.3838527Z         x1 = x[:, D:]
2025-05-07T20:32:15.3838727Z     
2025-05-07T20:32:15.3838922Z         if contiguous:
2025-05-07T20:32:15.3839152Z             x0 = x0.contiguous()
2025-05-07T20:32:15.3839408Z             x1 = x1.contiguous()
2025-05-07T20:32:15.3839651Z     
2025-05-07T20:32:15.3839846Z         if scale_ub is not None:
2025-05-07T20:32:15.3840111Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.3840445Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.3840753Z             )
2025-05-07T20:32:15.3840937Z         else:
2025-05-07T20:32:15.3841150Z             scale_ub_tensor = None
2025-05-07T20:32:15.3841404Z     
2025-05-07T20:32:15.3841634Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.3841945Z             op = silu_mul_quant
2025-05-07T20:32:15.3842193Z             if compiled:
2025-05-07T20:32:15.3842446Z                 op = torch.compile(op)
2025-05-07T20:32:15.3842734Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.3843012Z     
2025-05-07T20:32:15.3843207Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.3843368Z 
2025-05-07T20:32:15.3843469Z moe/activation_test.py:117: 
2025-05-07T20:32:15.3843776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.3844104Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.3844376Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.3845061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.3845745Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.3846274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.3846947Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.3847599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.3848130Z     kernel = self.compile(
2025-05-07T20:32:15.3848667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.3849397Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.3849795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.3850018Z 
2025-05-07T20:32:15.3850229Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a5916ff20>
2025-05-07T20:32:15.3851286Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.3852712Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a421c1440>}
2025-05-07T20:32:15.3854130Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.3855140Z context = <triton._C.libtriton.ir.context object at 0x7f6a42279b30>
2025-05-07T20:32:15.3855421Z 
2025-05-07T20:32:15.3855590Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.3856094Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.3856558Z                            module_map=module_map)
2025-05-07T20:32:15.3856929Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.3857280Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.3857544Z E       ^
2025-05-07T20:32:15.3858009Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.3858451Z 
2025-05-07T20:32:15.3858866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.3859537Z 
2025-05-07T20:32:15.3859652Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.3860070Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.3860474Z     T=128,
2025-05-07T20:32:15.3860672Z     D=5120,
2025-05-07T20:32:15.3860864Z     scale_ub=None,
2025-05-07T20:32:15.3861083Z     contiguous=True,
2025-05-07T20:32:15.3861312Z     compiled=False,
2025-05-07T20:32:15.3861518Z )
2025-05-07T20:32:15.4534292Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.4535046Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:15.4535422Z 
2025-05-07T20:32:15.4535543Z     @given(
2025-05-07T20:32:15.4535858Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.4536294Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.4536720Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.4537049Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.4537389Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.4537685Z     )
2025-05-07T20:32:15.4538034Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.4538478Z     def test_silu_mul_quant(
2025-05-07T20:32:15.4538727Z         self,
2025-05-07T20:32:15.4538934Z         T: int,
2025-05-07T20:32:15.4539131Z         D: int,
2025-05-07T20:32:15.4539352Z         scale_ub: Optional[float],
2025-05-07T20:32:15.4539633Z         contiguous: bool,
2025-05-07T20:32:15.4539871Z         compiled: bool,
2025-05-07T20:32:15.4540103Z     ) -> None:
2025-05-07T20:32:15.4540328Z         torch.manual_seed(2025)
2025-05-07T20:32:15.4540568Z     
2025-05-07T20:32:15.4540850Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.4541195Z     
2025-05-07T20:32:15.4541390Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.4541688Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.4541998Z         x = x_sign * x_clamp
2025-05-07T20:32:15.4542431Z         x0 = x[:, :D]
2025-05-07T20:32:15.4542661Z         x1 = x[:, D:]
2025-05-07T20:32:15.4542878Z     
2025-05-07T20:32:15.4543065Z         if contiguous:
2025-05-07T20:32:15.4543299Z             x0 = x0.contiguous()
2025-05-07T20:32:15.4543559Z             x1 = x1.contiguous()
2025-05-07T20:32:15.4543806Z     
2025-05-07T20:32:15.4544003Z         if scale_ub is not None:
2025-05-07T20:32:15.4544284Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.4544742Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.4545050Z             )
2025-05-07T20:32:15.4545249Z         else:
2025-05-07T20:32:15.4545466Z             scale_ub_tensor = None
2025-05-07T20:32:15.4545717Z     
2025-05-07T20:32:15.4545952Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.4546267Z             op = silu_mul_quant
2025-05-07T20:32:15.4546520Z             if compiled:
2025-05-07T20:32:15.4546770Z                 op = torch.compile(op)
2025-05-07T20:32:15.4547073Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.4547343Z     
2025-05-07T20:32:15.4547540Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.4547707Z 
2025-05-07T20:32:15.4547816Z moe/activation_test.py:117: 
2025-05-07T20:32:15.4548110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.4548445Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.4548731Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.4549425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.4550106Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.4550643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.4551315Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.4551982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.4552509Z     kernel = self.compile(
2025-05-07T20:32:15.4553050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.4553698Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.4554089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.4554325Z 
2025-05-07T20:32:15.4554532Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a5916eb10>
2025-05-07T20:32:15.4555624Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.4564137Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a421c2520>}
2025-05-07T20:32:15.4565476Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.4566484Z context = <triton._C.libtriton.ir.context object at 0x7f6a42008970>
2025-05-07T20:32:15.4566779Z 
2025-05-07T20:32:15.4566945Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.4567475Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.4567934Z                            module_map=module_map)
2025-05-07T20:32:15.4568302Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.4568655Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.4568908Z E       ^
2025-05-07T20:32:15.4569527Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.4569977Z 
2025-05-07T20:32:15.4570391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.4570895Z 
2025-05-07T20:32:15.4571008Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.4571409Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.4571916Z     T=128,
2025-05-07T20:32:15.4572103Z     D=7168,
2025-05-07T20:32:15.4572292Z     scale_ub=None,
2025-05-07T20:32:15.4572502Z     contiguous=True,
2025-05-07T20:32:15.4572721Z     compiled=False,
2025-05-07T20:32:15.4572918Z )
2025-05-07T20:32:15.4573312Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.4573786Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:15.4574047Z 
2025-05-07T20:32:15.4574130Z     @given(
2025-05-07T20:32:15.4574363Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.4574666Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.4574968Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.4575283Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.4575611Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.4575900Z     )
2025-05-07T20:32:15.4576242Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.4576684Z     def test_silu_mul_quant(
2025-05-07T20:32:15.4576925Z         self,
2025-05-07T20:32:15.4577113Z         T: int,
2025-05-07T20:32:15.4577313Z         D: int,
2025-05-07T20:32:15.4577529Z         scale_ub: Optional[float],
2025-05-07T20:32:15.4577803Z         contiguous: bool,
2025-05-07T20:32:15.4578074Z         compiled: bool,
2025-05-07T20:32:15.4578306Z     ) -> None:
2025-05-07T20:32:15.4578524Z         torch.manual_seed(2025)
2025-05-07T20:32:15.4578754Z     
2025-05-07T20:32:15.4579027Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.4579360Z     
2025-05-07T20:32:15.4579550Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.4579838Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.4580142Z         x = x_sign * x_clamp
2025-05-07T20:32:15.4580376Z         x0 = x[:, :D]
2025-05-07T20:32:15.4580592Z         x1 = x[:, D:]
2025-05-07T20:32:15.4580798Z     
2025-05-07T20:32:15.4580976Z         if contiguous:
2025-05-07T20:32:15.4581205Z             x0 = x0.contiguous()
2025-05-07T20:32:15.4581457Z             x1 = x1.contiguous()
2025-05-07T20:32:15.4581687Z     
2025-05-07T20:32:15.4581876Z         if scale_ub is not None:
2025-05-07T20:32:15.4582142Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.4582470Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.4582769Z             )
2025-05-07T20:32:15.4582957Z         else:
2025-05-07T20:32:15.4583170Z             scale_ub_tensor = None
2025-05-07T20:32:15.4583418Z     
2025-05-07T20:32:15.4583655Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.4583960Z             op = silu_mul_quant
2025-05-07T20:32:15.4584200Z             if compiled:
2025-05-07T20:32:15.4584452Z                 op = torch.compile(op)
2025-05-07T20:32:15.4584739Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.4585003Z     
2025-05-07T20:32:15.4585194Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.4585353Z 
2025-05-07T20:32:15.4585456Z moe/activation_test.py:117: 
2025-05-07T20:32:15.4585745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.4586074Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.4586348Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.4587024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.4587779Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.4588306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.4588977Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.4589622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.4590220Z     kernel = self.compile(
2025-05-07T20:32:15.4590748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.4591387Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.4591774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.4591999Z 
2025-05-07T20:32:15.4592201Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a59131760>
2025-05-07T20:32:15.4593266Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.4594622Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a421c3420>}
2025-05-07T20:32:15.4595940Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.4596936Z context = <triton._C.libtriton.ir.context object at 0x7f6a420ef0b0>
2025-05-07T20:32:15.4597222Z 
2025-05-07T20:32:15.4597386Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.4597896Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.4598354Z                            module_map=module_map)
2025-05-07T20:32:15.4598713Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.4599058Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.4599306Z E       ^
2025-05-07T20:32:15.4599756Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.4600205Z 
2025-05-07T20:32:15.4600615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.4601115Z 
2025-05-07T20:32:15.4601223Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.4601626Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.4602011Z     T=2048,
2025-05-07T20:32:15.4602195Z     D=7168,
2025-05-07T20:32:15.4602384Z     scale_ub=1200.0,
2025-05-07T20:32:15.4602596Z     contiguous=True,
2025-05-07T20:32:15.4602821Z     compiled=False,
2025-05-07T20:32:15.4603023Z )
2025-05-07T20:32:15.5405878Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.5406601Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:15.5406989Z 
2025-05-07T20:32:15.5407116Z     @given(
2025-05-07T20:32:15.5407432Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.5407868Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.5408315Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.5408659Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.5408998Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.5409316Z     )
2025-05-07T20:32:15.5409675Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.5410121Z     def test_silu_mul_quant(
2025-05-07T20:32:15.5410369Z         self,
2025-05-07T20:32:15.5410581Z         T: int,
2025-05-07T20:32:15.5410965Z         D: int,
2025-05-07T20:32:15.5411183Z         scale_ub: Optional[float],
2025-05-07T20:32:15.5411461Z         contiguous: bool,
2025-05-07T20:32:15.5411701Z         compiled: bool,
2025-05-07T20:32:15.5411933Z     ) -> None:
2025-05-07T20:32:15.5412150Z         torch.manual_seed(2025)
2025-05-07T20:32:15.5412389Z     
2025-05-07T20:32:15.5412671Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.5414910Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.5416742Z 
2025-05-07T20:32:15.5416861Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.5417073Z 
2025-05-07T20:32:15.5417177Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.5417598Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.5417994Z     T=1,
2025-05-07T20:32:15.5418183Z     D=5120,
2025-05-07T20:32:15.5418383Z     scale_ub=1200.0,
2025-05-07T20:32:15.5418610Z     contiguous=True,
2025-05-07T20:32:15.5418829Z     compiled=False,
2025-05-07T20:32:15.5419036Z )
2025-05-07T20:32:15.5419356Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.5419837Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:15.5420100Z 
2025-05-07T20:32:15.5420178Z     @given(
2025-05-07T20:32:15.5420411Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.5420722Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.5421040Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.5421366Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.5421686Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.5421973Z     )
2025-05-07T20:32:15.5422322Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.5422760Z     def test_silu_mul_quant(
2025-05-07T20:32:15.5423006Z         self,
2025-05-07T20:32:15.5423201Z         T: int,
2025-05-07T20:32:15.5423405Z         D: int,
2025-05-07T20:32:15.5423623Z         scale_ub: Optional[float],
2025-05-07T20:32:15.5423900Z         contiguous: bool,
2025-05-07T20:32:15.5424139Z         compiled: bool,
2025-05-07T20:32:15.5424357Z     ) -> None:
2025-05-07T20:32:15.5424578Z         torch.manual_seed(2025)
2025-05-07T20:32:15.5424820Z     
2025-05-07T20:32:15.5425090Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.5425440Z     
2025-05-07T20:32:15.5425641Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.5425927Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.5426235Z         x = x_sign * x_clamp
2025-05-07T20:32:15.5426484Z         x0 = x[:, :D]
2025-05-07T20:32:15.5426702Z         x1 = x[:, D:]
2025-05-07T20:32:15.5426911Z     
2025-05-07T20:32:15.5427101Z         if contiguous:
2025-05-07T20:32:15.5427330Z             x0 = x0.contiguous()
2025-05-07T20:32:15.5427602Z             x1 = x1.contiguous()
2025-05-07T20:32:15.5427850Z     
2025-05-07T20:32:15.5428051Z         if scale_ub is not None:
2025-05-07T20:32:15.5428327Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.5428662Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.5428973Z             )
2025-05-07T20:32:15.5429168Z         else:
2025-05-07T20:32:15.5429387Z             scale_ub_tensor = None
2025-05-07T20:32:15.5429638Z     
2025-05-07T20:32:15.5429963Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.5430283Z             op = silu_mul_quant
2025-05-07T20:32:15.5430532Z             if compiled:
2025-05-07T20:32:15.5430782Z                 op = torch.compile(op)
2025-05-07T20:32:15.5431086Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.5431373Z     
2025-05-07T20:32:15.5431569Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.5431745Z 
2025-05-07T20:32:15.5431847Z moe/activation_test.py:117: 
2025-05-07T20:32:15.5432258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.5432590Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.5432883Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.5433578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.5434283Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.5435030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.5435728Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.5436395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.5436930Z     kernel = self.compile(
2025-05-07T20:32:15.5437482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.5438146Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.5438558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.5438787Z 
2025-05-07T20:32:15.5439001Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a59131790>
2025-05-07T20:32:15.5440084Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.5441450Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a420149a0>}
2025-05-07T20:32:15.5442783Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.5443809Z context = <triton._C.libtriton.ir.context object at 0x7f6a42082bb0>
2025-05-07T20:32:15.5444098Z 
2025-05-07T20:32:15.5444269Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.5444793Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.5445265Z                            module_map=module_map)
2025-05-07T20:32:15.5445640Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.5446003Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.5446275Z E       ^
2025-05-07T20:32:15.5446746Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.5447197Z 
2025-05-07T20:32:15.5447611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.5448134Z 
2025-05-07T20:32:15.5448243Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.5448665Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.5449066Z     T=2048,
2025-05-07T20:32:15.5449271Z     D=5120,
2025-05-07T20:32:15.5449473Z     scale_ub=None,
2025-05-07T20:32:15.5449702Z     contiguous=True,
2025-05-07T20:32:15.5449931Z     compiled=False,
2025-05-07T20:32:15.5450150Z )
2025-05-07T20:32:15.5450563Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.5451059Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:15.5451334Z 
2025-05-07T20:32:15.5451422Z     @given(
2025-05-07T20:32:15.5451665Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.5451985Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.5452305Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.5452721Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.5453093Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.5453391Z     )
2025-05-07T20:32:15.5453748Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.5454196Z     def test_silu_mul_quant(
2025-05-07T20:32:15.5454443Z         self,
2025-05-07T20:32:15.5454653Z         T: int,
2025-05-07T20:32:15.5454857Z         D: int,
2025-05-07T20:32:15.5455078Z         scale_ub: Optional[float],
2025-05-07T20:32:15.5455372Z         contiguous: bool,
2025-05-07T20:32:15.5455628Z         compiled: bool,
2025-05-07T20:32:15.5455852Z     ) -> None:
2025-05-07T20:32:15.5456078Z         torch.manual_seed(2025)
2025-05-07T20:32:15.5456325Z     
2025-05-07T20:32:15.5456600Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.5456954Z     
2025-05-07T20:32:15.5457158Z >       x_sign = torch.sign(x)
2025-05-07T20:32:15.5459095Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.5461093Z 
2025-05-07T20:32:15.5461221Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:15.5461436Z 
2025-05-07T20:32:15.5461541Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.5461955Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.5462364Z     T=16384,
2025-05-07T20:32:15.5462560Z     D=5120,
2025-05-07T20:32:15.5462763Z     scale_ub=None,
2025-05-07T20:32:15.5462989Z     contiguous=True,
2025-05-07T20:32:15.5463214Z     compiled=False,
2025-05-07T20:32:15.5463428Z )
2025-05-07T20:32:15.6228189Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.6228917Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:15.6229302Z 
2025-05-07T20:32:15.6229418Z     @given(
2025-05-07T20:32:15.6229745Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.6230183Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.6230627Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.6230970Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.6231303Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.6231595Z     )
2025-05-07T20:32:15.6231943Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.6232393Z     def test_silu_mul_quant(
2025-05-07T20:32:15.6232650Z         self,
2025-05-07T20:32:15.6232855Z         T: int,
2025-05-07T20:32:15.6233061Z         D: int,
2025-05-07T20:32:15.6233290Z         scale_ub: Optional[float],
2025-05-07T20:32:15.6233561Z         contiguous: bool,
2025-05-07T20:32:15.6233816Z         compiled: bool,
2025-05-07T20:32:15.6234049Z     ) -> None:
2025-05-07T20:32:15.6234276Z         torch.manual_seed(2025)
2025-05-07T20:32:15.6234522Z     
2025-05-07T20:32:15.6234809Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.6237030Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.6238979Z 
2025-05-07T20:32:15.6239111Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.6239321Z 
2025-05-07T20:32:15.6239425Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.6239843Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.6240251Z     T=4096,
2025-05-07T20:32:15.6240446Z     D=5120,
2025-05-07T20:32:15.6240644Z     scale_ub=None,
2025-05-07T20:32:15.6240865Z     contiguous=True,
2025-05-07T20:32:15.6241106Z     compiled=False,
2025-05-07T20:32:15.6241317Z )
2025-05-07T20:32:15.6241651Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.6242150Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:15.6242417Z 
2025-05-07T20:32:15.6242502Z     @given(
2025-05-07T20:32:15.6242742Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.6243070Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.6243372Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.6243711Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.6244049Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.6244345Z     )
2025-05-07T20:32:15.6244693Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.6245140Z     def test_silu_mul_quant(
2025-05-07T20:32:15.6245392Z         self,
2025-05-07T20:32:15.6245592Z         T: int,
2025-05-07T20:32:15.6245796Z         D: int,
2025-05-07T20:32:15.6246026Z         scale_ub: Optional[float],
2025-05-07T20:32:15.6246299Z         contiguous: bool,
2025-05-07T20:32:15.6246548Z         compiled: bool,
2025-05-07T20:32:15.6246782Z     ) -> None:
2025-05-07T20:32:15.6247006Z         torch.manual_seed(2025)
2025-05-07T20:32:15.6247256Z     
2025-05-07T20:32:15.6247534Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.6249559Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.6251374Z 
2025-05-07T20:32:15.6251505Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.6251719Z 
2025-05-07T20:32:15.6251829Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.6252252Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.6252656Z     T=2048,
2025-05-07T20:32:15.6252849Z     D=5120,
2025-05-07T20:32:15.6253140Z     scale_ub=None,
2025-05-07T20:32:15.6253365Z     contiguous=False,
2025-05-07T20:32:15.6253596Z     compiled=False,
2025-05-07T20:32:15.6253810Z )
2025-05-07T20:32:15.6254134Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.6254623Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:15.6254903Z 
2025-05-07T20:32:15.6254983Z     @given(
2025-05-07T20:32:15.6255225Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.6255633Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.6255951Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.6256290Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.6256632Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.6256917Z     )
2025-05-07T20:32:15.6257274Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.6257793Z     def test_silu_mul_quant(
2025-05-07T20:32:15.6258036Z         self,
2025-05-07T20:32:15.6258236Z         T: int,
2025-05-07T20:32:15.6258450Z         D: int,
2025-05-07T20:32:15.6258675Z         scale_ub: Optional[float],
2025-05-07T20:32:15.6258960Z         contiguous: bool,
2025-05-07T20:32:15.6259399Z         compiled: bool,
2025-05-07T20:32:15.6259637Z     ) -> None:
2025-05-07T20:32:15.6259859Z         torch.manual_seed(2025)
2025-05-07T20:32:15.6260112Z     
2025-05-07T20:32:15.6260406Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.6262434Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.6264256Z 
2025-05-07T20:32:15.6264387Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.6264605Z 
2025-05-07T20:32:15.6264708Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.6265129Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.6265533Z     T=4096,
2025-05-07T20:32:15.6265734Z     D=7168,
2025-05-07T20:32:15.6265933Z     scale_ub=None,
2025-05-07T20:32:15.6266153Z     contiguous=True,
2025-05-07T20:32:15.6266390Z     compiled=True,
2025-05-07T20:32:15.6266598Z )
2025-05-07T20:32:15.6266923Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.6267420Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:15.6267690Z 
2025-05-07T20:32:15.6267776Z     @given(
2025-05-07T20:32:15.6268022Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.6268340Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.6268648Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.6268983Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.6269317Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.6269611Z     )
2025-05-07T20:32:15.6269965Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.6270415Z     def test_silu_mul_quant(
2025-05-07T20:32:15.6270672Z         self,
2025-05-07T20:32:15.6270868Z         T: int,
2025-05-07T20:32:15.6271077Z         D: int,
2025-05-07T20:32:15.6271302Z         scale_ub: Optional[float],
2025-05-07T20:32:15.6271572Z         contiguous: bool,
2025-05-07T20:32:15.6271822Z         compiled: bool,
2025-05-07T20:32:15.6272063Z     ) -> None:
2025-05-07T20:32:15.6272283Z         torch.manual_seed(2025)
2025-05-07T20:32:15.6272550Z     
2025-05-07T20:32:15.6272830Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.6275003Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.6276828Z 
2025-05-07T20:32:15.6276955Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.6277168Z 
2025-05-07T20:32:15.6277276Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.6277701Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.6278219Z     T=2048,
2025-05-07T20:32:15.6285105Z     D=5120,
2025-05-07T20:32:15.6285339Z     scale_ub=1200.0,
2025-05-07T20:32:15.6285575Z     contiguous=False,
2025-05-07T20:32:15.6285797Z     compiled=False,
2025-05-07T20:32:15.6285995Z )
2025-05-07T20:32:15.6286310Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.6286810Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:15.6287083Z 
2025-05-07T20:32:15.6287162Z     @given(
2025-05-07T20:32:15.6287399Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.6287710Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.6288012Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.6288330Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.6288649Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.6288929Z     )
2025-05-07T20:32:15.6289264Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.6289702Z     def test_silu_mul_quant(
2025-05-07T20:32:15.6289941Z         self,
2025-05-07T20:32:15.6290125Z         T: int,
2025-05-07T20:32:15.6290323Z         D: int,
2025-05-07T20:32:15.6290541Z         scale_ub: Optional[float],
2025-05-07T20:32:15.6290801Z         contiguous: bool,
2025-05-07T20:32:15.6291034Z         compiled: bool,
2025-05-07T20:32:15.6291251Z     ) -> None:
2025-05-07T20:32:15.6291455Z         torch.manual_seed(2025)
2025-05-07T20:32:15.6291696Z     
2025-05-07T20:32:15.6291965Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.6294027Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.6295853Z 
2025-05-07T20:32:15.6295976Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.6296182Z 
2025-05-07T20:32:15.6296283Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.6296688Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.6297082Z     T=4096,
2025-05-07T20:32:15.6297263Z     D=7168,
2025-05-07T20:32:15.6297455Z     scale_ub=1200.0,
2025-05-07T20:32:15.6297673Z     contiguous=True,
2025-05-07T20:32:15.6297894Z     compiled=False,
2025-05-07T20:32:15.6298095Z )
2025-05-07T20:32:15.7351818Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.7353256Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:15.7354021Z 
2025-05-07T20:32:15.7354244Z     @given(
2025-05-07T20:32:15.7354814Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.7355447Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.7356061Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.7356726Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.7357380Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.7357937Z     )
2025-05-07T20:32:15.7358536Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.7358983Z     def test_silu_mul_quant(
2025-05-07T20:32:15.7359393Z         self,
2025-05-07T20:32:15.7359593Z         T: int,
2025-05-07T20:32:15.7359793Z         D: int,
2025-05-07T20:32:15.7360010Z         scale_ub: Optional[float],
2025-05-07T20:32:15.7360287Z         contiguous: bool,
2025-05-07T20:32:15.7360534Z         compiled: bool,
2025-05-07T20:32:15.7360759Z     ) -> None:
2025-05-07T20:32:15.7361108Z         torch.manual_seed(2025)
2025-05-07T20:32:15.7361351Z     
2025-05-07T20:32:15.7361626Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.7363654Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.7365492Z 
2025-05-07T20:32:15.7365611Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.7365820Z 
2025-05-07T20:32:15.7365927Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.7366352Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.7366751Z     T=16384,
2025-05-07T20:32:15.7366948Z     D=7168,
2025-05-07T20:32:15.7367145Z     scale_ub=None,
2025-05-07T20:32:15.7367353Z     contiguous=False,
2025-05-07T20:32:15.7367579Z     compiled=True,
2025-05-07T20:32:15.7367785Z )
2025-05-07T20:32:15.7368099Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.7368598Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:15.7368875Z 
2025-05-07T20:32:15.7368959Z     @given(
2025-05-07T20:32:15.7369184Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.7369493Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.7369800Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.7370130Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.7370455Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.7370750Z     )
2025-05-07T20:32:15.7371100Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.7371539Z     def test_silu_mul_quant(
2025-05-07T20:32:15.7371786Z         self,
2025-05-07T20:32:15.7371988Z         T: int,
2025-05-07T20:32:15.7372180Z         D: int,
2025-05-07T20:32:15.7372401Z         scale_ub: Optional[float],
2025-05-07T20:32:15.7372670Z         contiguous: bool,
2025-05-07T20:32:15.7372914Z         compiled: bool,
2025-05-07T20:32:15.7373212Z     ) -> None:
2025-05-07T20:32:15.7373439Z         torch.manual_seed(2025)
2025-05-07T20:32:15.7373683Z     
2025-05-07T20:32:15.7373952Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.7375967Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.7377800Z 
2025-05-07T20:32:15.7377920Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.7378130Z 
2025-05-07T20:32:15.7378240Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.7378764Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.7379165Z     T=4096,
2025-05-07T20:32:15.7379356Z     D=7168,
2025-05-07T20:32:15.7379549Z     scale_ub=None,
2025-05-07T20:32:15.7379773Z     contiguous=True,
2025-05-07T20:32:15.7380001Z     compiled=False,
2025-05-07T20:32:15.7380206Z )
2025-05-07T20:32:15.7380529Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.7381096Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:15.7381365Z 
2025-05-07T20:32:15.7381454Z     @given(
2025-05-07T20:32:15.7381686Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.7381995Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.7382302Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.7382625Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.7382957Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.7383250Z     )
2025-05-07T20:32:15.7383599Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.7384037Z     def test_silu_mul_quant(
2025-05-07T20:32:15.7384284Z         self,
2025-05-07T20:32:15.7384484Z         T: int,
2025-05-07T20:32:15.7384684Z         D: int,
2025-05-07T20:32:15.7384912Z         scale_ub: Optional[float],
2025-05-07T20:32:15.7385188Z         contiguous: bool,
2025-05-07T20:32:15.7385436Z         compiled: bool,
2025-05-07T20:32:15.7385660Z     ) -> None:
2025-05-07T20:32:15.7385878Z         torch.manual_seed(2025)
2025-05-07T20:32:15.7386119Z     
2025-05-07T20:32:15.7386392Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.7388463Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.7390279Z 
2025-05-07T20:32:15.7390405Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.7390618Z 
2025-05-07T20:32:15.7390731Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.7391136Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.7391532Z     T=16384,
2025-05-07T20:32:15.7391731Z     D=7168,
2025-05-07T20:32:15.7391920Z     scale_ub=None,
2025-05-07T20:32:15.7392139Z     contiguous=True,
2025-05-07T20:32:15.7392368Z     compiled=False,
2025-05-07T20:32:15.7392573Z )
2025-05-07T20:32:15.7392893Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.7393390Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:15.7393663Z 
2025-05-07T20:32:15.7393741Z     @given(
2025-05-07T20:32:15.7393971Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.7394281Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.7394587Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.7394911Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.7395250Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.7395546Z     )
2025-05-07T20:32:15.7395894Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.7396343Z     def test_silu_mul_quant(
2025-05-07T20:32:15.7396581Z         self,
2025-05-07T20:32:15.7396774Z         T: int,
2025-05-07T20:32:15.7396980Z         D: int,
2025-05-07T20:32:15.7397203Z         scale_ub: Optional[float],
2025-05-07T20:32:15.7397469Z         contiguous: bool,
2025-05-07T20:32:15.7397797Z         compiled: bool,
2025-05-07T20:32:15.7398019Z     ) -> None:
2025-05-07T20:32:15.7398241Z         torch.manual_seed(2025)
2025-05-07T20:32:15.7398484Z     
2025-05-07T20:32:15.7398751Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.7400762Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.7402660Z 
2025-05-07T20:32:15.7402780Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.7402992Z 
2025-05-07T20:32:15.7403101Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.7403514Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.7403911Z     T=16384,
2025-05-07T20:32:15.7404105Z     D=7168,
2025-05-07T20:32:15.7404300Z     scale_ub=1200.0,
2025-05-07T20:32:15.7404522Z     contiguous=True,
2025-05-07T20:32:15.7404749Z     compiled=False,
2025-05-07T20:32:15.7404956Z )
2025-05-07T20:32:15.7405277Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.7405770Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:15.7406046Z 
2025-05-07T20:32:15.7406130Z     @given(
2025-05-07T20:32:15.7406358Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.7406664Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.7406967Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.7407292Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.7407617Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.7407904Z     )
2025-05-07T20:32:15.7408253Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.7408689Z     def test_silu_mul_quant(
2025-05-07T20:32:15.7408927Z         self,
2025-05-07T20:32:15.7409126Z         T: int,
2025-05-07T20:32:15.7409325Z         D: int,
2025-05-07T20:32:15.7409537Z         scale_ub: Optional[float],
2025-05-07T20:32:15.7409814Z         contiguous: bool,
2025-05-07T20:32:15.7410056Z         compiled: bool,
2025-05-07T20:32:15.7410275Z     ) -> None:
2025-05-07T20:32:15.7410494Z         torch.manual_seed(2025)
2025-05-07T20:32:15.7410735Z     
2025-05-07T20:32:15.7410997Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.7413057Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.7414888Z 
2025-05-07T20:32:15.7415010Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.7415219Z 
2025-05-07T20:32:15.7415322Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.7415729Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.7416125Z     T=128,
2025-05-07T20:32:15.7416312Z     D=5120,
2025-05-07T20:32:15.7416502Z     scale_ub=1200.0,
2025-05-07T20:32:15.7416721Z     contiguous=False,
2025-05-07T20:32:15.7416954Z     compiled=False,
2025-05-07T20:32:15.7417163Z )
2025-05-07T20:32:15.8687677Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.8688899Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:15.8689647Z 
2025-05-07T20:32:15.8689864Z     @given(
2025-05-07T20:32:15.8690490Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.8691323Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.8692119Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.8693212Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.8693847Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.8694404Z     )
2025-05-07T20:32:15.8695091Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.8695960Z     def test_silu_mul_quant(
2025-05-07T20:32:15.8696424Z         self,
2025-05-07T20:32:15.8696809Z         T: int,
2025-05-07T20:32:15.8697202Z         D: int,
2025-05-07T20:32:15.8697620Z         scale_ub: Optional[float],
2025-05-07T20:32:15.8698041Z         contiguous: bool,
2025-05-07T20:32:15.8698280Z         compiled: bool,
2025-05-07T20:32:15.8698499Z     ) -> None:
2025-05-07T20:32:15.8698713Z         torch.manual_seed(2025)
2025-05-07T20:32:15.8698950Z     
2025-05-07T20:32:15.8699214Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.8699551Z     
2025-05-07T20:32:15.8699746Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.8700035Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.8700343Z         x = x_sign * x_clamp
2025-05-07T20:32:15.8700590Z         x0 = x[:, :D]
2025-05-07T20:32:15.8700802Z         x1 = x[:, D:]
2025-05-07T20:32:15.8701009Z     
2025-05-07T20:32:15.8701196Z         if contiguous:
2025-05-07T20:32:15.8701423Z             x0 = x0.contiguous()
2025-05-07T20:32:15.8701682Z             x1 = x1.contiguous()
2025-05-07T20:32:15.8701928Z     
2025-05-07T20:32:15.8702121Z         if scale_ub is not None:
2025-05-07T20:32:15.8702400Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.8702739Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.8703046Z             )
2025-05-07T20:32:15.8703239Z         else:
2025-05-07T20:32:15.8703458Z             scale_ub_tensor = None
2025-05-07T20:32:15.8703716Z     
2025-05-07T20:32:15.8703949Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.8704262Z             op = silu_mul_quant
2025-05-07T20:32:15.8704520Z             if compiled:
2025-05-07T20:32:15.8704766Z                 op = torch.compile(op)
2025-05-07T20:32:15.8705060Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.8705330Z     
2025-05-07T20:32:15.8705517Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.8705691Z 
2025-05-07T20:32:15.8705790Z moe/activation_test.py:117: 
2025-05-07T20:32:15.8706089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.8706412Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.8706701Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.8707386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.8708084Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.8708679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.8709355Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.8710012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.8710542Z     kernel = self.compile(
2025-05-07T20:32:15.8711083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.8711726Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.8712209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.8712439Z 
2025-05-07T20:32:15.8712649Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a41f592e0>
2025-05-07T20:32:15.8713714Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.8715141Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a41fc7560>}
2025-05-07T20:32:15.8716464Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.8717476Z context = <triton._C.libtriton.ir.context object at 0x7f6a41eba530>
2025-05-07T20:32:15.8717762Z 
2025-05-07T20:32:15.8717937Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.8718497Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.8718961Z                            module_map=module_map)
2025-05-07T20:32:15.8719326Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.8719680Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.8719941Z E       ^
2025-05-07T20:32:15.8720400Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.8720841Z 
2025-05-07T20:32:15.8721259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.8721760Z 
2025-05-07T20:32:15.8721866Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.8722285Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.8722687Z     T=2048,
2025-05-07T20:32:15.8722879Z     D=7168,
2025-05-07T20:32:15.8723064Z     scale_ub=None,
2025-05-07T20:32:15.8723281Z     contiguous=False,
2025-05-07T20:32:15.8723503Z     compiled=False,
2025-05-07T20:32:15.8723704Z )
2025-05-07T20:32:15.8724023Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.8724516Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:15.8724786Z 
2025-05-07T20:32:15.8724870Z     @given(
2025-05-07T20:32:15.8725103Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.8725417Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.8725719Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.8726050Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.8726381Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.8726667Z     )
2025-05-07T20:32:15.8727018Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.8727459Z     def test_silu_mul_quant(
2025-05-07T20:32:15.8727703Z         self,
2025-05-07T20:32:15.8727896Z         T: int,
2025-05-07T20:32:15.8728101Z         D: int,
2025-05-07T20:32:15.8728324Z         scale_ub: Optional[float],
2025-05-07T20:32:15.8728595Z         contiguous: bool,
2025-05-07T20:32:15.8728845Z         compiled: bool,
2025-05-07T20:32:15.8729073Z     ) -> None:
2025-05-07T20:32:15.8729288Z         torch.manual_seed(2025)
2025-05-07T20:32:15.8729531Z     
2025-05-07T20:32:15.8729805Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.8731904Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.8733780Z 
2025-05-07T20:32:15.8733902Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.8734109Z 
2025-05-07T20:32:15.8734208Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.8734694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.8735086Z     T=128,
2025-05-07T20:32:15.8735269Z     D=7168,
2025-05-07T20:32:15.8735462Z     scale_ub=1200.0,
2025-05-07T20:32:15.8735682Z     contiguous=True,
2025-05-07T20:32:15.8735899Z     compiled=True,
2025-05-07T20:32:15.8736107Z )
2025-05-07T20:32:15.9043014Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.9043786Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:15.9044154Z 
2025-05-07T20:32:15.9044267Z     @given(
2025-05-07T20:32:15.9044572Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.9044993Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.9045414Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.9045740Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.9046072Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.9046370Z     )
2025-05-07T20:32:15.9046718Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.9047158Z     def test_silu_mul_quant(
2025-05-07T20:32:15.9047398Z         self,
2025-05-07T20:32:15.9047589Z         T: int,
2025-05-07T20:32:15.9047793Z         D: int,
2025-05-07T20:32:15.9048013Z         scale_ub: Optional[float],
2025-05-07T20:32:15.9048313Z         contiguous: bool,
2025-05-07T20:32:15.9048573Z         compiled: bool,
2025-05-07T20:32:15.9048808Z     ) -> None:
2025-05-07T20:32:15.9049022Z         torch.manual_seed(2025)
2025-05-07T20:32:15.9049263Z     
2025-05-07T20:32:15.9049534Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.9049872Z     
2025-05-07T20:32:15.9050063Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.9050356Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.9050667Z         x = x_sign * x_clamp
2025-05-07T20:32:15.9050913Z         x0 = x[:, :D]
2025-05-07T20:32:15.9051137Z         x1 = x[:, D:]
2025-05-07T20:32:15.9051354Z     
2025-05-07T20:32:15.9051541Z         if contiguous:
2025-05-07T20:32:15.9051775Z             x0 = x0.contiguous()
2025-05-07T20:32:15.9052041Z             x1 = x1.contiguous()
2025-05-07T20:32:15.9052284Z     
2025-05-07T20:32:15.9052482Z         if scale_ub is not None:
2025-05-07T20:32:15.9052759Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.9053146Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.9053461Z             )
2025-05-07T20:32:15.9053662Z         else:
2025-05-07T20:32:15.9053873Z             scale_ub_tensor = None
2025-05-07T20:32:15.9054118Z     
2025-05-07T20:32:15.9054351Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.9054668Z             op = silu_mul_quant
2025-05-07T20:32:15.9054918Z             if compiled:
2025-05-07T20:32:15.9055169Z                 op = torch.compile(op)
2025-05-07T20:32:15.9055471Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.9055748Z     
2025-05-07T20:32:15.9055943Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.9056103Z 
2025-05-07T20:32:15.9056209Z moe/activation_test.py:117: 
2025-05-07T20:32:15.9056506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.9056834Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.9057118Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.9057831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.9058390Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.9059044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.9059907Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.9060432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.9061232Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.9061890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.9062415Z     kernel = self.compile(
2025-05-07T20:32:15.9062945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.9063595Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.9063992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.9064216Z 
2025-05-07T20:32:15.9064427Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a41e2e270>
2025-05-07T20:32:15.9065491Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.9066846Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a41e30e00>}
2025-05-07T20:32:15.9068160Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.9069168Z context = <triton._C.libtriton.ir.context object at 0x7f6a41c776b0>
2025-05-07T20:32:15.9069450Z 
2025-05-07T20:32:15.9069615Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.9070127Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.9070593Z                            module_map=module_map)
2025-05-07T20:32:15.9070957Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.9071307Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.9078516Z E       ^
2025-05-07T20:32:15.9079022Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.9079473Z 
2025-05-07T20:32:15.9079889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.9080393Z 
2025-05-07T20:32:15.9080500Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.9080917Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.9081308Z     T=128,
2025-05-07T20:32:15.9081490Z     D=7168,
2025-05-07T20:32:15.9081698Z     scale_ub=1200.0,
2025-05-07T20:32:15.9081913Z     contiguous=True,
2025-05-07T20:32:15.9082133Z     compiled=False,
2025-05-07T20:32:15.9082330Z )
2025-05-07T20:32:15.9082639Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.9083123Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:15.9083389Z 
2025-05-07T20:32:15.9083468Z     @given(
2025-05-07T20:32:15.9083686Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.9083988Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.9084288Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.9084610Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.9084924Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.9085362Z     )
2025-05-07T20:32:15.9085710Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.9086133Z     def test_silu_mul_quant(
2025-05-07T20:32:15.9086367Z         self,
2025-05-07T20:32:15.9086554Z         T: int,
2025-05-07T20:32:15.9086745Z         D: int,
2025-05-07T20:32:15.9086960Z         scale_ub: Optional[float],
2025-05-07T20:32:15.9087220Z         contiguous: bool,
2025-05-07T20:32:15.9087537Z         compiled: bool,
2025-05-07T20:32:15.9087751Z     ) -> None:
2025-05-07T20:32:15.9087958Z         torch.manual_seed(2025)
2025-05-07T20:32:15.9088198Z     
2025-05-07T20:32:15.9088461Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.9088799Z     
2025-05-07T20:32:15.9088983Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.9089267Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.9091250Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.9093211Z 
2025-05-07T20:32:15.9093331Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:15.9093537Z 
2025-05-07T20:32:15.9093639Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.9094037Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.9094422Z     T=128,
2025-05-07T20:32:15.9094604Z     D=5120,
2025-05-07T20:32:15.9094790Z     scale_ub=1200.0,
2025-05-07T20:32:15.9095002Z     contiguous=True,
2025-05-07T20:32:15.9095224Z     compiled=True,
2025-05-07T20:32:15.9095419Z )
2025-05-07T20:32:15.9095724Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.9096197Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:15.9096457Z 
2025-05-07T20:32:15.9096536Z     @given(
2025-05-07T20:32:15.9096753Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.9097065Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.9097362Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.9097686Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.9098008Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.9098332Z     )
2025-05-07T20:32:15.9098673Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.9099097Z     def test_silu_mul_quant(
2025-05-07T20:32:15.9099330Z         self,
2025-05-07T20:32:15.9099519Z         T: int,
2025-05-07T20:32:15.9099711Z         D: int,
2025-05-07T20:32:15.9099927Z         scale_ub: Optional[float],
2025-05-07T20:32:15.9100189Z         contiguous: bool,
2025-05-07T20:32:15.9100414Z         compiled: bool,
2025-05-07T20:32:15.9100624Z     ) -> None:
2025-05-07T20:32:15.9100831Z         torch.manual_seed(2025)
2025-05-07T20:32:15.9101062Z     
2025-05-07T20:32:15.9101325Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.9101664Z     
2025-05-07T20:32:15.9101847Z >       x_sign = torch.sign(x)
2025-05-07T20:32:15.9103832Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.9105647Z 
2025-05-07T20:32:15.9105764Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:15.9105974Z 
2025-05-07T20:32:15.9106073Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.9106476Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.9106934Z     T=128,
2025-05-07T20:32:15.9107117Z     D=7168,
2025-05-07T20:32:15.9107304Z     scale_ub=None,
2025-05-07T20:32:15.9107509Z     contiguous=True,
2025-05-07T20:32:15.9107724Z     compiled=True,
2025-05-07T20:32:15.9107921Z )
2025-05-07T20:32:16.2405075Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.2405578Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.2405908Z 
2025-05-07T20:32:16.2406028Z     @given(
2025-05-07T20:32:16.2406369Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.2406796Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.2407200Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.2407591Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.2407916Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.2408203Z     )
2025-05-07T20:32:16.2408554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.2408997Z     def test_silu_mul_quant(
2025-05-07T20:32:16.2409234Z         self,
2025-05-07T20:32:16.2409432Z         T: int,
2025-05-07T20:32:16.2409630Z         D: int,
2025-05-07T20:32:16.2409846Z         scale_ub: Optional[float],
2025-05-07T20:32:16.2410116Z         contiguous: bool,
2025-05-07T20:32:16.2410358Z         compiled: bool,
2025-05-07T20:32:16.2410583Z     ) -> None:
2025-05-07T20:32:16.2410796Z         torch.manual_seed(2025)
2025-05-07T20:32:16.2411041Z     
2025-05-07T20:32:16.2411317Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2413427Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.2415272Z 
2025-05-07T20:32:16.2415391Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.2415602Z 
2025-05-07T20:32:16.2500818Z FAILED
2025-05-07T20:32:16.2501124Z 
2025-05-07T20:32:16.2501334Z =================================== FAILURES ===================================
2025-05-07T20:32:16.2501810Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:16.2502361Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:16.2503218Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:32:16.2503798Z   |     yield
2025-05-07T20:32:16.2504241Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run
2025-05-07T20:32:16.2504761Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:16.2505332Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
2025-05-07T20:32:16.2505878Z   |     if method() is not None:
2025-05-07T20:32:16.2506127Z   |        ^^^^^^^^
2025-05-07T20:32:16.2506757Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:16.2507836Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.2508141Z   |            ^^^^^^^
2025-05-07T20:32:16.2508700Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:16.2509306Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:16.2509724Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:16.2510261Z   +-+---------------- 1 ----------------
2025-05-07T20:32:16.2510546Z     | Traceback (most recent call last):
2025-05-07T20:32:16.2511232Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:16.2511997Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2512381Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:16.2514964Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.2517065Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:16.2517509Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2517909Z     |     T=128,
2025-05-07T20:32:16.2518122Z     |     D=7168,
2025-05-07T20:32:16.2518344Z     |     scale_ub=1200.0,
2025-05-07T20:32:16.2518597Z     |     contiguous=True,
2025-05-07T20:32:16.2518844Z     |     compiled=False,
2025-05-07T20:32:16.2519089Z     | )
2025-05-07T20:32:16.2519272Z     | 
2025-05-07T20:32:16.2519924Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case
2025-05-07T20:32:16.2520722Z     +---------------- 2 ----------------
2025-05-07T20:32:16.2521110Z     | Traceback (most recent call last):
2025-05-07T20:32:16.2522071Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:16.2523122Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2523626Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:16.2526338Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.2528855Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:16.2529291Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2529699Z     |     T=128,
2025-05-07T20:32:16.2529905Z     |     D=7168,
2025-05-07T20:32:16.2530111Z     |     scale_ub=None,
2025-05-07T20:32:16.2530353Z     |     contiguous=True,
2025-05-07T20:32:16.2530595Z     |     compiled=True,
2025-05-07T20:32:16.2530818Z     | )
2025-05-07T20:32:16.2530999Z     | 
2025-05-07T20:32:16.2531524Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:16.2532244Z     +---------------- 3 ----------------
2025-05-07T20:32:16.2532538Z     | Traceback (most recent call last):
2025-05-07T20:32:16.2533341Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:16.2534107Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2534559Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:16.2536512Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.2538445Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:16.2538881Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2539281Z     |     T=128,
2025-05-07T20:32:16.2539482Z     |     D=5120,
2025-05-07T20:32:16.2539694Z     |     scale_ub=1200.0,
2025-05-07T20:32:16.2539946Z     |     contiguous=True,
2025-05-07T20:32:16.2540182Z     |     compiled=True,
2025-05-07T20:32:16.2540409Z     | )
2025-05-07T20:32:16.2540598Z     | 
2025-05-07T20:32:16.2541113Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:16.2541711Z     +---------------- 4 ----------------
2025-05-07T20:32:16.2542194Z     | Traceback (most recent call last):
2025-05-07T20:32:16.2542901Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:16.2543603Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.2543890Z     |                              ^^^^^^^^
2025-05-07T20:32:16.2544525Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:16.2545208Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.2545547Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:16.2546332Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:16.2547160Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.2547772Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:16.2548710Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.2549306Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:16.2550172Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:16.2551218Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.2551869Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:16.2552739Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:16.2553722Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.2554233Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:16.2555158Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:16.2555935Z     |     fn()
2025-05-07T20:32:16.2556705Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:16.2557574Z     |     self.fn.run(
2025-05-07T20:32:16.2558383Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:16.2559398Z     |     kernel = self.compile(
2025-05-07T20:32:16.2559764Z     |              ^^^^^^^^^^^^^
2025-05-07T20:32:16.2560577Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:16.2561526Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.2562056Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:16.2562929Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:16.2564034Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.2564681Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:16.2565204Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.2565687Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.2566044Z     | ^
2025-05-07T20:32:16.2566676Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.2567429Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:16.2567878Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:16.2568400Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2568840Z     |     T=1,  # or any other generated value
2025-05-07T20:32:16.2569153Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:16.2569498Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:16.2569862Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:16.2570226Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:16.2570533Z     | )
2025-05-07T20:32:16.2570722Z     | 
2025-05-07T20:32:16.2571240Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:16.2571845Z     +------------------------------------
2025-05-07T20:32:16.2572208Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:16.2572587Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.2573085Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2573485Z     T=1,
2025-05-07T20:32:16.2573671Z     D=5120,
2025-05-07T20:32:16.2573864Z     scale_ub=None,
2025-05-07T20:32:16.2574090Z     contiguous=True,
2025-05-07T20:32:16.2574318Z     compiled=True,
2025-05-07T20:32:16.2574525Z )
2025-05-07T20:32:16.2574846Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.2575333Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.2575588Z 
2025-05-07T20:32:16.2575669Z     @given(
2025-05-07T20:32:16.2575907Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.2576226Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.2576535Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.2576861Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.2577192Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.2577666Z     )
2025-05-07T20:32:16.2578013Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.2578446Z     def test_silu_mul_quant(
2025-05-07T20:32:16.2578689Z         self,
2025-05-07T20:32:16.2578877Z         T: int,
2025-05-07T20:32:16.2579073Z         D: int,
2025-05-07T20:32:16.2579292Z         scale_ub: Optional[float],
2025-05-07T20:32:16.2579554Z         contiguous: bool,
2025-05-07T20:32:16.2579913Z         compiled: bool,
2025-05-07T20:32:16.2580137Z     ) -> None:
2025-05-07T20:32:16.2580346Z         torch.manual_seed(2025)
2025-05-07T20:32:16.2580591Z     
2025-05-07T20:32:16.2580864Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2581197Z     
2025-05-07T20:32:16.2581394Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.2581684Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.2581991Z         x = x_sign * x_clamp
2025-05-07T20:32:16.2582227Z         x0 = x[:, :D]
2025-05-07T20:32:16.2582450Z         x1 = x[:, D:]
2025-05-07T20:32:16.2582662Z     
2025-05-07T20:32:16.2582848Z         if contiguous:
2025-05-07T20:32:16.2583081Z             x0 = x0.contiguous()
2025-05-07T20:32:16.2583342Z             x1 = x1.contiguous()
2025-05-07T20:32:16.2583575Z     
2025-05-07T20:32:16.2583769Z         if scale_ub is not None:
2025-05-07T20:32:16.2584040Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.2584374Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.2584682Z             )
2025-05-07T20:32:16.2584878Z         else:
2025-05-07T20:32:16.2585082Z             scale_ub_tensor = None
2025-05-07T20:32:16.2585334Z     
2025-05-07T20:32:16.2585560Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2585862Z             op = silu_mul_quant
2025-05-07T20:32:16.2586114Z             if compiled:
2025-05-07T20:32:16.2586358Z                 op = torch.compile(op)
2025-05-07T20:32:16.2586656Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2587005Z     
2025-05-07T20:32:16.2587263Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.2587653Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.2588042Z     
2025-05-07T20:32:16.2588368Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2588844Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.2589261Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.2589738Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.2590262Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.2590668Z     
2025-05-07T20:32:16.2590946Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.2591177Z 
2025-05-07T20:32:16.2591279Z moe/activation_test.py:126: 
2025-05-07T20:32:16.2591604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2592064Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.2592537Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.2593618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.2594626Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.2595368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.2596692Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.2597639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.2598595Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.2599601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.2600610Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.2601452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.2602154Z     fn()
2025-05-07T20:32:16.2602853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.2603628Z     self.fn.run(
2025-05-07T20:32:16.2604329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.2605059Z     kernel = self.compile(
2025-05-07T20:32:16.2605795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.2629063Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.2629630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2629947Z 
2025-05-07T20:32:16.2630251Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a7025f8f0>
2025-05-07T20:32:16.2631713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.2633564Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a7053bec0>}
2025-05-07T20:32:16.2635346Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.2636705Z context = <triton._C.libtriton.ir.context object at 0x7f6a70535d70>
2025-05-07T20:32:16.2637101Z 
2025-05-07T20:32:16.2637343Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.2638082Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.2638762Z                            module_map=module_map)
2025-05-07T20:32:16.2639238Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.2639708Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.2640085Z E       ^
2025-05-07T20:32:16.2640734Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.2641349Z 
2025-05-07T20:32:16.2641922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.2642639Z 
2025-05-07T20:32:16.2642782Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.2643355Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2643898Z     T=2048,
2025-05-07T20:32:16.2644156Z     D=5120,
2025-05-07T20:32:16.2644436Z     scale_ub=1200.0,
2025-05-07T20:32:16.2644742Z     contiguous=True,
2025-05-07T20:32:16.2645049Z     compiled=False,
2025-05-07T20:32:16.2645341Z )
2025-05-07T20:32:16.2645785Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.2646461Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.2646844Z 
2025-05-07T20:32:16.2646959Z     @given(
2025-05-07T20:32:16.2647284Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.2647717Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.2648125Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.2648565Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.2648998Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.2649369Z     )
2025-05-07T20:32:16.2649844Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.2650586Z     def test_silu_mul_quant(
2025-05-07T20:32:16.2650902Z         self,
2025-05-07T20:32:16.2651161Z         T: int,
2025-05-07T20:32:16.2651420Z         D: int,
2025-05-07T20:32:16.2651703Z         scale_ub: Optional[float],
2025-05-07T20:32:16.2652063Z         contiguous: bool,
2025-05-07T20:32:16.2652374Z         compiled: bool,
2025-05-07T20:32:16.2652658Z     ) -> None:
2025-05-07T20:32:16.2652938Z         torch.manual_seed(2025)
2025-05-07T20:32:16.2653485Z     
2025-05-07T20:32:16.2653841Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2654280Z     
2025-05-07T20:32:16.2654517Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.2654877Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.2655243Z         x = x_sign * x_clamp
2025-05-07T20:32:16.2655538Z         x0 = x[:, :D]
2025-05-07T20:32:16.2655800Z         x1 = x[:, D:]
2025-05-07T20:32:16.2656047Z     
2025-05-07T20:32:16.2656295Z         if contiguous:
2025-05-07T20:32:16.2656634Z             x0 = x0.contiguous()
2025-05-07T20:32:16.2656969Z             x1 = x1.contiguous()
2025-05-07T20:32:16.2657308Z     
2025-05-07T20:32:16.2657587Z         if scale_ub is not None:
2025-05-07T20:32:16.2657962Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.2658433Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.2658844Z             )
2025-05-07T20:32:16.2659082Z         else:
2025-05-07T20:32:16.2659828Z             scale_ub_tensor = None
2025-05-07T20:32:16.2660181Z     
2025-05-07T20:32:16.2660495Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2660948Z             op = silu_mul_quant
2025-05-07T20:32:16.2661306Z             if compiled:
2025-05-07T20:32:16.2661657Z                 op = torch.compile(op)
2025-05-07T20:32:16.2662071Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2662421Z     
2025-05-07T20:32:16.2662685Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.2662905Z 
2025-05-07T20:32:16.2663049Z moe/activation_test.py:117: 
2025-05-07T20:32:16.2663476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2663908Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.2664274Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2665239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.2666213Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.2666959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.2667865Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.2668823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.2669563Z     kernel = self.compile(
2025-05-07T20:32:16.2670234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.2671069Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.2671606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2671884Z 
2025-05-07T20:32:16.2672143Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a701c7ad0>
2025-05-07T20:32:16.2673486Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.2675204Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a704dce00>}
2025-05-07T20:32:16.2677144Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.2678460Z context = <triton._C.libtriton.ir.context object at 0x7f6a7010f370>
2025-05-07T20:32:16.2678808Z 
2025-05-07T20:32:16.2679016Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.2679649Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.2680380Z                            module_map=module_map)
2025-05-07T20:32:16.2680822Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.2681248Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.2681608Z E       ^
2025-05-07T20:32:16.2682209Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.2682771Z 
2025-05-07T20:32:16.2683302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.2683925Z 
2025-05-07T20:32:16.2684050Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.2684562Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2685081Z     T=2048,
2025-05-07T20:32:16.2685315Z     D=5120,
2025-05-07T20:32:16.2685553Z     scale_ub=1200.0,
2025-05-07T20:32:16.2685826Z     contiguous=True,
2025-05-07T20:32:16.2686103Z     compiled=True,
2025-05-07T20:32:16.2686397Z )
2025-05-07T20:32:16.2686837Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.2687493Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.2687855Z 
2025-05-07T20:32:16.2687963Z     @given(
2025-05-07T20:32:16.2688274Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.2688699Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.2689112Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.2689559Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.2690011Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.2690392Z     )
2025-05-07T20:32:16.2690877Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.2691473Z     def test_silu_mul_quant(
2025-05-07T20:32:16.2691804Z         self,
2025-05-07T20:32:16.2692069Z         T: int,
2025-05-07T20:32:16.2692343Z         D: int,
2025-05-07T20:32:16.2692643Z         scale_ub: Optional[float],
2025-05-07T20:32:16.2693125Z         contiguous: bool,
2025-05-07T20:32:16.2693470Z         compiled: bool,
2025-05-07T20:32:16.2693787Z     ) -> None:
2025-05-07T20:32:16.2694087Z         torch.manual_seed(2025)
2025-05-07T20:32:16.2694436Z     
2025-05-07T20:32:16.2694812Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2695277Z     
2025-05-07T20:32:16.2695554Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.2695971Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.2696398Z         x = x_sign * x_clamp
2025-05-07T20:32:16.2696736Z         x0 = x[:, :D]
2025-05-07T20:32:16.2697039Z         x1 = x[:, D:]
2025-05-07T20:32:16.2697322Z     
2025-05-07T20:32:16.2697583Z         if contiguous:
2025-05-07T20:32:16.2697905Z             x0 = x0.contiguous()
2025-05-07T20:32:16.2698305Z             x1 = x1.contiguous()
2025-05-07T20:32:16.2698647Z     
2025-05-07T20:32:16.2698920Z         if scale_ub is not None:
2025-05-07T20:32:16.2699308Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.2699759Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.2700185Z             )
2025-05-07T20:32:16.2700452Z         else:
2025-05-07T20:32:16.2700747Z             scale_ub_tensor = None
2025-05-07T20:32:16.2701087Z     
2025-05-07T20:32:16.2701405Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2701944Z             op = silu_mul_quant
2025-05-07T20:32:16.2702287Z             if compiled:
2025-05-07T20:32:16.2702602Z                 op = torch.compile(op)
2025-05-07T20:32:16.2703002Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2703391Z     
2025-05-07T20:32:16.2703662Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.2704044Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.2704442Z     
2025-05-07T20:32:16.2704880Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2705354Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.2705765Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.2706196Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.2706670Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.2707072Z     
2025-05-07T20:32:16.2707351Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.2707618Z 
2025-05-07T20:32:16.2707761Z moe/activation_test.py:126: 
2025-05-07T20:32:16.2708187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2708668Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.2709106Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.2710191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.2711234Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.2711977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.2712910Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.2713855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.2714834Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.2715832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.2716703Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.2717529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.2718263Z     fn()
2025-05-07T20:32:16.2718992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.2719786Z     self.fn.run(
2025-05-07T20:32:16.2720424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.2721144Z     kernel = self.compile(
2025-05-07T20:32:16.2721868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.2722726Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.2723248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2723557Z 
2025-05-07T20:32:16.2723835Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a70340290>
2025-05-07T20:32:16.2725333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.2727202Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a701ed440>}
2025-05-07T20:32:16.2729061Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.2731452Z context = <triton._C.libtriton.ir.context object at 0x7f6a59f165f0>
2025-05-07T20:32:16.2731871Z 
2025-05-07T20:32:16.2732107Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.2732827Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.2733588Z                            module_map=module_map)
2025-05-07T20:32:16.2734258Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.2734770Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.2735137Z E       ^
2025-05-07T20:32:16.2735769Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.2736390Z 
2025-05-07T20:32:16.2736957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.2737655Z 
2025-05-07T20:32:16.2737816Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.2738438Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2738991Z     T=16384,
2025-05-07T20:32:16.2739257Z     D=7168,
2025-05-07T20:32:16.2739518Z     scale_ub=1200.0,
2025-05-07T20:32:16.2739826Z     contiguous=False,
2025-05-07T20:32:16.2740106Z     compiled=False,
2025-05-07T20:32:16.2740361Z )
2025-05-07T20:32:16.2740774Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.2741454Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.2741830Z 
2025-05-07T20:32:16.2741930Z     @given(
2025-05-07T20:32:16.2742217Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.2742643Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.2743044Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.2743447Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.2743864Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.2744225Z     )
2025-05-07T20:32:16.2744641Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.2745199Z     def test_silu_mul_quant(
2025-05-07T20:32:16.2745501Z         self,
2025-05-07T20:32:16.2745756Z         T: int,
2025-05-07T20:32:16.2746003Z         D: int,
2025-05-07T20:32:16.2746307Z         scale_ub: Optional[float],
2025-05-07T20:32:16.2746666Z         contiguous: bool,
2025-05-07T20:32:16.2746970Z         compiled: bool,
2025-05-07T20:32:16.2747274Z     ) -> None:
2025-05-07T20:32:16.2747543Z         torch.manual_seed(2025)
2025-05-07T20:32:16.2747879Z     
2025-05-07T20:32:16.2748285Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2748786Z     
2025-05-07T20:32:16.2749049Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.2749448Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.2749887Z         x = x_sign * x_clamp
2025-05-07T20:32:16.2750223Z         x0 = x[:, :D]
2025-05-07T20:32:16.2750528Z         x1 = x[:, D:]
2025-05-07T20:32:16.2750812Z     
2025-05-07T20:32:16.2751067Z         if contiguous:
2025-05-07T20:32:16.2751395Z             x0 = x0.contiguous()
2025-05-07T20:32:16.2751751Z             x1 = x1.contiguous()
2025-05-07T20:32:16.2752075Z     
2025-05-07T20:32:16.2752348Z         if scale_ub is not None:
2025-05-07T20:32:16.2752730Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.2753206Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.2753627Z             )
2025-05-07T20:32:16.2753901Z         else:
2025-05-07T20:32:16.2754202Z             scale_ub_tensor = None
2025-05-07T20:32:16.2754552Z     
2025-05-07T20:32:16.2754876Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2755314Z             op = silu_mul_quant
2025-05-07T20:32:16.2755660Z             if compiled:
2025-05-07T20:32:16.2756013Z                 op = torch.compile(op)
2025-05-07T20:32:16.2756535Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2756917Z     
2025-05-07T20:32:16.2757191Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.2757423Z 
2025-05-07T20:32:16.2757567Z moe/activation_test.py:117: 
2025-05-07T20:32:16.2757968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2758419Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.2758911Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2760069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.2761001Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.2761734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.2762683Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.2763564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.2764277Z     kernel = self.compile(
2025-05-07T20:32:16.2765025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.2765918Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.2766463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2766793Z 
2025-05-07T20:32:16.2767086Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a70243e90>
2025-05-07T20:32:16.2768610Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.2770466Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a5a149940>}
2025-05-07T20:32:16.2772265Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.2773750Z context = <triton._C.libtriton.ir.context object at 0x7f6a59f52630>
2025-05-07T20:32:16.2774160Z 
2025-05-07T20:32:16.2774389Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.2775116Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.2775769Z                            module_map=module_map)
2025-05-07T20:32:16.2776254Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.2776719Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.2777066Z E       ^
2025-05-07T20:32:16.2777706Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.2778379Z 
2025-05-07T20:32:16.2778959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.2779665Z 
2025-05-07T20:32:16.2779814Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.2780385Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2780939Z     T=1,
2025-05-07T20:32:16.2781192Z     D=7168,
2025-05-07T20:32:16.2781448Z     scale_ub=None,
2025-05-07T20:32:16.2781739Z     contiguous=True,
2025-05-07T20:32:16.2782044Z     compiled=True,
2025-05-07T20:32:16.2782320Z )
2025-05-07T20:32:16.2782740Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.2783387Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.2783737Z 
2025-05-07T20:32:16.2783851Z     @given(
2025-05-07T20:32:16.2784380Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.2784822Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.2785250Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.2785712Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.2786165Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.2786560Z     )
2025-05-07T20:32:16.2787045Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.2787789Z     def test_silu_mul_quant(
2025-05-07T20:32:16.2788130Z         self,
2025-05-07T20:32:16.2788401Z         T: int,
2025-05-07T20:32:16.2788670Z         D: int,
2025-05-07T20:32:16.2788986Z         scale_ub: Optional[float],
2025-05-07T20:32:16.2789366Z         contiguous: bool,
2025-05-07T20:32:16.2789681Z         compiled: bool,
2025-05-07T20:32:16.2789986Z     ) -> None:
2025-05-07T20:32:16.2790273Z         torch.manual_seed(2025)
2025-05-07T20:32:16.2790586Z     
2025-05-07T20:32:16.2790956Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2791397Z     
2025-05-07T20:32:16.2791660Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.2792066Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.2792471Z         x = x_sign * x_clamp
2025-05-07T20:32:16.2792772Z         x0 = x[:, :D]
2025-05-07T20:32:16.2793025Z         x1 = x[:, D:]
2025-05-07T20:32:16.2793320Z     
2025-05-07T20:32:16.2793584Z         if contiguous:
2025-05-07T20:32:16.2793894Z             x0 = x0.contiguous()
2025-05-07T20:32:16.2794253Z             x1 = x1.contiguous()
2025-05-07T20:32:16.2794587Z     
2025-05-07T20:32:16.2794844Z         if scale_ub is not None:
2025-05-07T20:32:16.2795228Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.2795694Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.2796125Z             )
2025-05-07T20:32:16.2796403Z         else:
2025-05-07T20:32:16.2796700Z             scale_ub_tensor = None
2025-05-07T20:32:16.2797048Z     
2025-05-07T20:32:16.2797374Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2797809Z             op = silu_mul_quant
2025-05-07T20:32:16.2798173Z             if compiled:
2025-05-07T20:32:16.2798541Z                 op = torch.compile(op)
2025-05-07T20:32:16.2798942Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2799307Z     
2025-05-07T20:32:16.2799566Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.2799949Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.2800339Z     
2025-05-07T20:32:16.2800633Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2801069Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.2801424Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.2801839Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.2802293Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.2802700Z     
2025-05-07T20:32:16.2802953Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.2803204Z 
2025-05-07T20:32:16.2803343Z moe/activation_test.py:126: 
2025-05-07T20:32:16.2803756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2804227Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.2804675Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.2805792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.2806832Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.2807569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.2808545Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.2809583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.2810554Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.2811523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.2812374Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.2813410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.2814123Z     fn()
2025-05-07T20:32:16.2814824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.2815622Z     self.fn.run(
2025-05-07T20:32:16.2816176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.2816702Z     kernel = self.compile(
2025-05-07T20:32:16.2817249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.2817897Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.2818299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2818526Z 
2025-05-07T20:32:16.2818739Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a70286570>
2025-05-07T20:32:16.2819813Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.2821171Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a700c0860>}
2025-05-07T20:32:16.2822499Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.2823501Z context = <triton._C.libtriton.ir.context object at 0x7f6a59a405f0>
2025-05-07T20:32:16.2823794Z 
2025-05-07T20:32:16.2823966Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.2824486Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.2824955Z                            module_map=module_map)
2025-05-07T20:32:16.2825316Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.2825677Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.2825949Z E       ^
2025-05-07T20:32:16.2826408Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.2826859Z 
2025-05-07T20:32:16.2827277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.2827785Z 
2025-05-07T20:32:16.2827894Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.2828343Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2828761Z     T=4096,
2025-05-07T20:32:16.2828959Z     D=5120,
2025-05-07T20:32:16.2829154Z     scale_ub=None,
2025-05-07T20:32:16.2829371Z     contiguous=False,
2025-05-07T20:32:16.2829599Z     compiled=False,
2025-05-07T20:32:16.2829809Z )
2025-05-07T20:32:16.2849120Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.2849664Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.2849947Z 
2025-05-07T20:32:16.2850031Z     @given(
2025-05-07T20:32:16.2850275Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.2850584Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.2851028Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.2851364Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.2851695Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.2851988Z     )
2025-05-07T20:32:16.2852343Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.2852784Z     def test_silu_mul_quant(
2025-05-07T20:32:16.2853272Z         self,
2025-05-07T20:32:16.2853469Z         T: int,
2025-05-07T20:32:16.2853675Z         D: int,
2025-05-07T20:32:16.2853886Z         scale_ub: Optional[float],
2025-05-07T20:32:16.2854160Z         contiguous: bool,
2025-05-07T20:32:16.2854403Z         compiled: bool,
2025-05-07T20:32:16.2854621Z     ) -> None:
2025-05-07T20:32:16.2854841Z         torch.manual_seed(2025)
2025-05-07T20:32:16.2855089Z     
2025-05-07T20:32:16.2855356Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2855697Z     
2025-05-07T20:32:16.2855904Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.2856189Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.2856495Z         x = x_sign * x_clamp
2025-05-07T20:32:16.2856735Z         x0 = x[:, :D]
2025-05-07T20:32:16.2856940Z         x1 = x[:, D:]
2025-05-07T20:32:16.2857141Z     
2025-05-07T20:32:16.2857328Z         if contiguous:
2025-05-07T20:32:16.2857552Z             x0 = x0.contiguous()
2025-05-07T20:32:16.2857812Z             x1 = x1.contiguous()
2025-05-07T20:32:16.2858052Z     
2025-05-07T20:32:16.2858258Z         if scale_ub is not None:
2025-05-07T20:32:16.2858556Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.2858892Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.2859475Z             )
2025-05-07T20:32:16.2859726Z         else:
2025-05-07T20:32:16.2859938Z             scale_ub_tensor = None
2025-05-07T20:32:16.2860188Z     
2025-05-07T20:32:16.2860416Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2860734Z             op = silu_mul_quant
2025-05-07T20:32:16.2860987Z             if compiled:
2025-05-07T20:32:16.2861230Z                 op = torch.compile(op)
2025-05-07T20:32:16.2861523Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2861800Z     
2025-05-07T20:32:16.2861986Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.2862155Z 
2025-05-07T20:32:16.2862255Z moe/activation_test.py:117: 
2025-05-07T20:32:16.2862561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2862887Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.2863160Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2863851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.2864539Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.2865068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.2865743Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.2866407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.2866934Z     kernel = self.compile(
2025-05-07T20:32:16.2867467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.2867651Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.2867779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2867784Z 
2025-05-07T20:32:16.2867993Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a5a2f4740>
2025-05-07T20:32:16.2869040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.2869543Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a596c40e0>}
2025-05-07T20:32:16.2870285Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.2870606Z context = <triton._C.libtriton.ir.context object at 0x7f6a58d4b930>
2025-05-07T20:32:16.2870611Z 
2025-05-07T20:32:16.2870784Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.2871043Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.2871153Z                            module_map=module_map)
2025-05-07T20:32:16.2871330Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.2871430Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.2871510Z E       ^
2025-05-07T20:32:16.2871868Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.2871873Z 
2025-05-07T20:32:16.2872282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.2872292Z 
2025-05-07T20:32:16.2872405Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.2872625Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2872704Z     T=4096,
2025-05-07T20:32:16.2872790Z     D=7168,
2025-05-07T20:32:16.2872876Z     scale_ub=None,
2025-05-07T20:32:16.2872964Z     contiguous=False,
2025-05-07T20:32:16.2873058Z     compiled=False,
2025-05-07T20:32:16.2873134Z )
2025-05-07T20:32:16.2873356Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.2873534Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.2873539Z 
2025-05-07T20:32:16.2873619Z     @given(
2025-05-07T20:32:16.2873744Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.2873843Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.2873961Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.2874091Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.2874210Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.2874288Z     )
2025-05-07T20:32:16.2874539Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.2874634Z     def test_silu_mul_quant(
2025-05-07T20:32:16.2874718Z         self,
2025-05-07T20:32:16.2874797Z         T: int,
2025-05-07T20:32:16.2874874Z         D: int,
2025-05-07T20:32:16.2874981Z         scale_ub: Optional[float],
2025-05-07T20:32:16.2875070Z         contiguous: bool,
2025-05-07T20:32:16.2875163Z         compiled: bool,
2025-05-07T20:32:16.2875249Z     ) -> None:
2025-05-07T20:32:16.2875345Z         torch.manual_seed(2025)
2025-05-07T20:32:16.2875421Z     
2025-05-07T20:32:16.2875597Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2875673Z     
2025-05-07T20:32:16.2875765Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.2875896Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.2875994Z         x = x_sign * x_clamp
2025-05-07T20:32:16.2876080Z         x0 = x[:, :D]
2025-05-07T20:32:16.2876166Z         x1 = x[:, D:]
2025-05-07T20:32:16.2876246Z     
2025-05-07T20:32:16.2876338Z         if contiguous:
2025-05-07T20:32:16.2876432Z             x0 = x0.contiguous()
2025-05-07T20:32:16.2876520Z             x1 = x1.contiguous()
2025-05-07T20:32:16.2876601Z     
2025-05-07T20:32:16.2876692Z         if scale_ub is not None:
2025-05-07T20:32:16.2876797Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.2877023Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.2877102Z             )
2025-05-07T20:32:16.2877178Z         else:
2025-05-07T20:32:16.2877282Z             scale_ub_tensor = None
2025-05-07T20:32:16.2877358Z     
2025-05-07T20:32:16.2877487Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2877582Z             op = silu_mul_quant
2025-05-07T20:32:16.2877668Z             if compiled:
2025-05-07T20:32:16.2877851Z                 op = torch.compile(op)
2025-05-07T20:32:16.2877956Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2878030Z     
2025-05-07T20:32:16.2878130Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.2878135Z 
2025-05-07T20:32:16.2878231Z moe/activation_test.py:117: 
2025-05-07T20:32:16.2878359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2878471Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.2878571Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2879076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.2879175Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.2879531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.2879757Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.2880096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.2880196Z     kernel = self.compile(
2025-05-07T20:32:16.2880580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.2880753Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.2880892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2880897Z 
2025-05-07T20:32:16.2881102Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a5a2f45c0>
2025-05-07T20:32:16.2881870Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.2882377Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a596c6160>}
2025-05-07T20:32:16.2883110Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.2883310Z context = <triton._C.libtriton.ir.context object at 0x7f6a58c958f0>
2025-05-07T20:32:16.2883315Z 
2025-05-07T20:32:16.2883484Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.2883744Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.2883859Z                            module_map=module_map)
2025-05-07T20:32:16.2884020Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.2884129Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.2884213Z E       ^
2025-05-07T20:32:16.2884562Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.2884567Z 
2025-05-07T20:32:16.2884984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.2884988Z 
2025-05-07T20:32:16.2885091Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.2885319Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2885506Z     T=128,
2025-05-07T20:32:16.2885589Z     D=7168,
2025-05-07T20:32:16.2885676Z     scale_ub=None,
2025-05-07T20:32:16.2885765Z     contiguous=False,
2025-05-07T20:32:16.2885851Z     compiled=True,
2025-05-07T20:32:16.2885934Z )
2025-05-07T20:32:16.2886151Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.2886319Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.2886483Z 
2025-05-07T20:32:16.2886569Z     @given(
2025-05-07T20:32:16.2886689Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.2886795Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.2886911Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.2887027Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.2887146Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.2887220Z     )
2025-05-07T20:32:16.2887467Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.2887572Z     def test_silu_mul_quant(
2025-05-07T20:32:16.2887651Z         self,
2025-05-07T20:32:16.2887728Z         T: int,
2025-05-07T20:32:16.2887811Z         D: int,
2025-05-07T20:32:16.2887911Z         scale_ub: Optional[float],
2025-05-07T20:32:16.2888005Z         contiguous: bool,
2025-05-07T20:32:16.2888112Z         compiled: bool,
2025-05-07T20:32:16.2888203Z     ) -> None:
2025-05-07T20:32:16.2888330Z         torch.manual_seed(2025)
2025-05-07T20:32:16.2888409Z     
2025-05-07T20:32:16.2888579Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2888663Z     
2025-05-07T20:32:16.2888757Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.2888884Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.2888979Z         x = x_sign * x_clamp
2025-05-07T20:32:16.2889061Z         x0 = x[:, :D]
2025-05-07T20:32:16.2889141Z         x1 = x[:, D:]
2025-05-07T20:32:16.2889223Z     
2025-05-07T20:32:16.2889315Z         if contiguous:
2025-05-07T20:32:16.2889407Z             x0 = x0.contiguous()
2025-05-07T20:32:16.2889504Z             x1 = x1.contiguous()
2025-05-07T20:32:16.2889580Z     
2025-05-07T20:32:16.2889678Z         if scale_ub is not None:
2025-05-07T20:32:16.2889785Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.2889917Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.2890007Z             )
2025-05-07T20:32:16.2890085Z         else:
2025-05-07T20:32:16.2890182Z             scale_ub_tensor = None
2025-05-07T20:32:16.2890265Z     
2025-05-07T20:32:16.2890397Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2890487Z             op = silu_mul_quant
2025-05-07T20:32:16.2890578Z             if compiled:
2025-05-07T20:32:16.2890679Z                 op = torch.compile(op)
2025-05-07T20:32:16.2890782Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2890863Z     
2025-05-07T20:32:16.2890958Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.2891084Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.2891160Z     
2025-05-07T20:32:16.2891298Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2891407Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.2891508Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.2891630Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.2891779Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.2891853Z     
2025-05-07T20:32:16.2891953Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.2891958Z 
2025-05-07T20:32:16.2892065Z moe/activation_test.py:126: 
2025-05-07T20:32:16.2892193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2892304Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.2892439Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.2893137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.2893249Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.2893606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.2893827Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.2894273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.2894527Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.2894903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.2895069Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.2895409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.2895494Z     fn()
2025-05-07T20:32:16.2895890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.2895977Z     self.fn.run(
2025-05-07T20:32:16.2896311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.2896412Z     kernel = self.compile(
2025-05-07T20:32:16.2896794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.2896967Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.2897094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2897099Z 
2025-05-07T20:32:16.2897310Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a597f8b30>
2025-05-07T20:32:16.2898078Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.2898632Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a5a226340>}
2025-05-07T20:32:16.2899368Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.2899567Z context = <triton._C.libtriton.ir.context object at 0x7f6a58ac4d70>
2025-05-07T20:32:16.2899571Z 
2025-05-07T20:32:16.2899735Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.2899999Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.2900113Z                            module_map=module_map)
2025-05-07T20:32:16.2900274Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.2900376Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.2900462Z E       ^
2025-05-07T20:32:16.2900813Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.2900824Z 
2025-05-07T20:32:16.2901236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.2901240Z 
2025-05-07T20:32:16.2901343Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.2901564Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2901653Z     T=128,
2025-05-07T20:32:16.2901730Z     D=7168,
2025-05-07T20:32:16.2901811Z     scale_ub=None,
2025-05-07T20:32:16.2901908Z     contiguous=False,
2025-05-07T20:32:16.2902098Z     compiled=False,
2025-05-07T20:32:16.2902182Z )
2025-05-07T20:32:16.2902431Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.2902619Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.2902623Z 
2025-05-07T20:32:16.2902707Z     @given(
2025-05-07T20:32:16.2902834Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.2903016Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.2903145Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.2903271Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.2903391Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.2903473Z     )
2025-05-07T20:32:16.2903759Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.2903859Z     def test_silu_mul_quant(
2025-05-07T20:32:16.2903936Z         self,
2025-05-07T20:32:16.2904018Z         T: int,
2025-05-07T20:32:16.2904101Z         D: int,
2025-05-07T20:32:16.2904203Z         scale_ub: Optional[float],
2025-05-07T20:32:16.2904295Z         contiguous: bool,
2025-05-07T20:32:16.2904388Z         compiled: bool,
2025-05-07T20:32:16.2904469Z     ) -> None:
2025-05-07T20:32:16.2904569Z         torch.manual_seed(2025)
2025-05-07T20:32:16.2904651Z     
2025-05-07T20:32:16.2904839Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2904918Z     
2025-05-07T20:32:16.2905017Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.2905151Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.2905247Z         x = x_sign * x_clamp
2025-05-07T20:32:16.2905329Z         x0 = x[:, :D]
2025-05-07T20:32:16.2905409Z         x1 = x[:, D:]
2025-05-07T20:32:16.2905488Z     
2025-05-07T20:32:16.2905573Z         if contiguous:
2025-05-07T20:32:16.2905667Z             x0 = x0.contiguous()
2025-05-07T20:32:16.2905765Z             x1 = x1.contiguous()
2025-05-07T20:32:16.2905843Z     
2025-05-07T20:32:16.2905936Z         if scale_ub is not None:
2025-05-07T20:32:16.2906052Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.2906195Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.2906272Z             )
2025-05-07T20:32:16.2906353Z         else:
2025-05-07T20:32:16.2906451Z             scale_ub_tensor = None
2025-05-07T20:32:16.2906532Z     
2025-05-07T20:32:16.2906675Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2906767Z             op = silu_mul_quant
2025-05-07T20:32:16.2906858Z             if compiled:
2025-05-07T20:32:16.2906961Z                 op = torch.compile(op)
2025-05-07T20:32:16.2907070Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2907148Z     
2025-05-07T20:32:16.2907241Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.2907246Z 
2025-05-07T20:32:16.2907346Z moe/activation_test.py:117: 
2025-05-07T20:32:16.2907496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2907602Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.2907707Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2908312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.2908413Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.2908847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.2909102Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.2909501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.2909605Z     kernel = self.compile(
2025-05-07T20:32:16.2910059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.2910345Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.2910478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2910483Z 
2025-05-07T20:32:16.2910688Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a59997e90>
2025-05-07T20:32:16.2911458Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.2912038Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a58fe8180>}
2025-05-07T20:32:16.2912780Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.2912970Z context = <triton._C.libtriton.ir.context object at 0x7f6a58587170>
2025-05-07T20:32:16.2912975Z 
2025-05-07T20:32:16.2913140Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.2913406Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.2913514Z                            module_map=module_map)
2025-05-07T20:32:16.2913687Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.2913790Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.2913865Z E       ^
2025-05-07T20:32:16.2914219Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.2914224Z 
2025-05-07T20:32:16.2914630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.2914635Z 
2025-05-07T20:32:16.2914752Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.2914974Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2915053Z     T=4096,
2025-05-07T20:32:16.2915137Z     D=5120,
2025-05-07T20:32:16.2915221Z     scale_ub=1200.0,
2025-05-07T20:32:16.2915310Z     contiguous=True,
2025-05-07T20:32:16.2915399Z     compiled=False,
2025-05-07T20:32:16.2915474Z )
2025-05-07T20:32:16.2915698Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.2915875Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.2915879Z 
2025-05-07T20:32:16.2915956Z     @given(
2025-05-07T20:32:16.2916082Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.2916183Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.2916301Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.2916426Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.2916544Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.2916618Z     )
2025-05-07T20:32:16.2916868Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.2916961Z     def test_silu_mul_quant(
2025-05-07T20:32:16.2917038Z         self,
2025-05-07T20:32:16.2917121Z         T: int,
2025-05-07T20:32:16.2917198Z         D: int,
2025-05-07T20:32:16.2917302Z         scale_ub: Optional[float],
2025-05-07T20:32:16.2917399Z         contiguous: bool,
2025-05-07T20:32:16.2917486Z         compiled: bool,
2025-05-07T20:32:16.2917570Z     ) -> None:
2025-05-07T20:32:16.2917665Z         torch.manual_seed(2025)
2025-05-07T20:32:16.2917740Z     
2025-05-07T20:32:16.2917917Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2917992Z     
2025-05-07T20:32:16.2918108Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.2918254Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.2918460Z         x = x_sign * x_clamp
2025-05-07T20:32:16.2918542Z         x0 = x[:, :D]
2025-05-07T20:32:16.2918630Z         x1 = x[:, D:]
2025-05-07T20:32:16.2918702Z     
2025-05-07T20:32:16.2918784Z         if contiguous:
2025-05-07T20:32:16.2918883Z             x0 = x0.contiguous()
2025-05-07T20:32:16.2918972Z             x1 = x1.contiguous()
2025-05-07T20:32:16.2919051Z     
2025-05-07T20:32:16.2919142Z         if scale_ub is not None:
2025-05-07T20:32:16.2919324Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.2919463Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.2919539Z             )
2025-05-07T20:32:16.2919617Z         else:
2025-05-07T20:32:16.2919717Z             scale_ub_tensor = None
2025-05-07T20:32:16.2919794Z     
2025-05-07T20:32:16.2919924Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2920020Z             op = silu_mul_quant
2025-05-07T20:32:16.2920109Z             if compiled:
2025-05-07T20:32:16.2920213Z                 op = torch.compile(op)
2025-05-07T20:32:16.2920324Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2920398Z     
2025-05-07T20:32:16.2920494Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.2920499Z 
2025-05-07T20:32:16.2920594Z moe/activation_test.py:117: 
2025-05-07T20:32:16.2920721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2920828Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.2920932Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2921426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.2921529Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.2921884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.2922109Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.2922448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.2922545Z     kernel = self.compile(
2025-05-07T20:32:16.2922927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.2923101Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.2923231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2923242Z 
2025-05-07T20:32:16.2923446Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a58de8f20>
2025-05-07T20:32:16.2924209Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.2924714Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a58fe9ee0>}
2025-05-07T20:32:16.2925444Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.2925637Z context = <triton._C.libtriton.ir.context object at 0x7f6a58b28170>
2025-05-07T20:32:16.2925646Z 
2025-05-07T20:32:16.2925811Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.2926067Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.2926182Z                            module_map=module_map)
2025-05-07T20:32:16.2926343Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.2926448Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.2926528Z E       ^
2025-05-07T20:32:16.2926962Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.2926966Z 
2025-05-07T20:32:16.2927384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.2927388Z 
2025-05-07T20:32:16.2927491Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.2927711Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2927872Z     T=1,
2025-05-07T20:32:16.2927953Z     D=5120,
2025-05-07T20:32:16.2928046Z     scale_ub=None,
2025-05-07T20:32:16.2928132Z     contiguous=True,
2025-05-07T20:32:16.2928227Z     compiled=True,
2025-05-07T20:32:16.2928322Z )
2025-05-07T20:32:16.2928564Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.2928726Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.2928730Z 
2025-05-07T20:32:16.2928817Z     @given(
2025-05-07T20:32:16.2928937Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.2929038Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.2929162Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.2929282Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.2929407Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.2929482Z     )
2025-05-07T20:32:16.2929729Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.2929828Z     def test_silu_mul_quant(
2025-05-07T20:32:16.2929907Z         self,
2025-05-07T20:32:16.2929984Z         T: int,
2025-05-07T20:32:16.2930066Z         D: int,
2025-05-07T20:32:16.2930166Z         scale_ub: Optional[float],
2025-05-07T20:32:16.2930256Z         contiguous: bool,
2025-05-07T20:32:16.2930349Z         compiled: bool,
2025-05-07T20:32:16.2930430Z     ) -> None:
2025-05-07T20:32:16.2930526Z         torch.manual_seed(2025)
2025-05-07T20:32:16.2930615Z     
2025-05-07T20:32:16.2930781Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2930859Z     
2025-05-07T20:32:16.2930953Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.2931080Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.2931182Z         x = x_sign * x_clamp
2025-05-07T20:32:16.2931262Z         x0 = x[:, :D]
2025-05-07T20:32:16.2931350Z         x1 = x[:, D:]
2025-05-07T20:32:16.2931427Z     
2025-05-07T20:32:16.2931512Z         if contiguous:
2025-05-07T20:32:16.2931604Z             x0 = x0.contiguous()
2025-05-07T20:32:16.2931704Z             x1 = x1.contiguous()
2025-05-07T20:32:16.2931779Z     
2025-05-07T20:32:16.2931870Z         if scale_ub is not None:
2025-05-07T20:32:16.2931982Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.2932115Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.2932195Z             )
2025-05-07T20:32:16.2932272Z         else:
2025-05-07T20:32:16.2932371Z             scale_ub_tensor = None
2025-05-07T20:32:16.2932457Z     
2025-05-07T20:32:16.2932586Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2932677Z             op = silu_mul_quant
2025-05-07T20:32:16.2932768Z             if compiled:
2025-05-07T20:32:16.2932870Z                 op = torch.compile(op)
2025-05-07T20:32:16.2932977Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2933109Z     
2025-05-07T20:32:16.2933201Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.2933323Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.2933401Z     
2025-05-07T20:32:16.2933538Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2933648Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.2933750Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.2933875Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.2934104Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.2934180Z     
2025-05-07T20:32:16.2934280Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.2934285Z 
2025-05-07T20:32:16.2934386Z moe/activation_test.py:126: 
2025-05-07T20:32:16.2934515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2934619Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.2934833Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.2935386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.2935497Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.2935854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.2936073Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.2936447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.2936701Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.2937073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.2937238Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.2937577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.2937663Z     fn()
2025-05-07T20:32:16.2938063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.2938145Z     self.fn.run(
2025-05-07T20:32:16.2938484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.2938584Z     kernel = self.compile(
2025-05-07T20:32:16.2938966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.2939142Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.2939268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2939273Z 
2025-05-07T20:32:16.2939482Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a58de9040>
2025-05-07T20:32:16.2940252Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.2940758Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a58feb420>}
2025-05-07T20:32:16.2941496Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.2941685Z context = <triton._C.libtriton.ir.context object at 0x7f6a582c78b0>
2025-05-07T20:32:16.2941696Z 
2025-05-07T20:32:16.2941866Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.2942126Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.2942243Z                            module_map=module_map)
2025-05-07T20:32:16.2942404Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.2942507Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.2942594Z E       ^
2025-05-07T20:32:16.2942950Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.2942955Z 
2025-05-07T20:32:16.2943451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.2943455Z 
2025-05-07T20:32:16.2943560Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.2943783Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2943870Z     T=2048,
2025-05-07T20:32:16.2943948Z     D=5120,
2025-05-07T20:32:16.2944129Z     scale_ub=None,
2025-05-07T20:32:16.2944219Z     contiguous=True,
2025-05-07T20:32:16.2944302Z     compiled=True,
2025-05-07T20:32:16.2944376Z )
2025-05-07T20:32:16.2944599Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.2944767Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.2944772Z 
2025-05-07T20:32:16.2944856Z     @given(
2025-05-07T20:32:16.2944980Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.2945086Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.2945208Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.2945324Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.2945439Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.2945520Z     )
2025-05-07T20:32:16.2945763Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.2945856Z     def test_silu_mul_quant(
2025-05-07T20:32:16.2945950Z         self,
2025-05-07T20:32:16.2946026Z         T: int,
2025-05-07T20:32:16.2946109Z         D: int,
2025-05-07T20:32:16.2946208Z         scale_ub: Optional[float],
2025-05-07T20:32:16.2946296Z         contiguous: bool,
2025-05-07T20:32:16.2946389Z         compiled: bool,
2025-05-07T20:32:16.2946470Z     ) -> None:
2025-05-07T20:32:16.2946562Z         torch.manual_seed(2025)
2025-05-07T20:32:16.2946640Z     
2025-05-07T20:32:16.2946810Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2946891Z     
2025-05-07T20:32:16.2946993Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.2947117Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.2947208Z         x = x_sign * x_clamp
2025-05-07T20:32:16.2947297Z         x0 = x[:, :D]
2025-05-07T20:32:16.2947381Z         x1 = x[:, D:]
2025-05-07T20:32:16.2947463Z     
2025-05-07T20:32:16.2947548Z         if contiguous:
2025-05-07T20:32:16.2947638Z             x0 = x0.contiguous()
2025-05-07T20:32:16.2947741Z             x1 = x1.contiguous()
2025-05-07T20:32:16.2947815Z     
2025-05-07T20:32:16.2947907Z         if scale_ub is not None:
2025-05-07T20:32:16.2948017Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.2948151Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.2948227Z             )
2025-05-07T20:32:16.2948315Z         else:
2025-05-07T20:32:16.2948410Z             scale_ub_tensor = None
2025-05-07T20:32:16.2948483Z     
2025-05-07T20:32:16.2948622Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2948712Z             op = silu_mul_quant
2025-05-07T20:32:16.2948796Z             if compiled:
2025-05-07T20:32:16.2948905Z                 op = torch.compile(op)
2025-05-07T20:32:16.2949011Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2949088Z     
2025-05-07T20:32:16.2949181Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.2949302Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.2949386Z     
2025-05-07T20:32:16.2949521Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2949623Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.2949732Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.2949853Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.2949993Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.2950074Z     
2025-05-07T20:32:16.2950178Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.2950269Z 
2025-05-07T20:32:16.2950375Z moe/activation_test.py:126: 
2025-05-07T20:32:16.2950502Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2950608Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.2950749Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.2951300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.2951483Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.2951841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.2952063Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.2952429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.2952688Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.2953064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.2953235Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.2953572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.2953660Z     fn()
2025-05-07T20:32:16.2954055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.2954137Z     self.fn.run(
2025-05-07T20:32:16.2954480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.2954575Z     kernel = self.compile(
2025-05-07T20:32:16.2954956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.2955136Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.2955265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2955270Z 
2025-05-07T20:32:16.2955482Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a58bd2bd0>
2025-05-07T20:32:16.2956242Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.2956740Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a58561800>}
2025-05-07T20:32:16.2957480Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.2957669Z context = <triton._C.libtriton.ir.context object at 0x7f6a588c4370>
2025-05-07T20:32:16.2957674Z 
2025-05-07T20:32:16.2957844Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.2958105Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.2958223Z                            module_map=module_map)
2025-05-07T20:32:16.2958423Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.2958535Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.2958621Z E       ^
2025-05-07T20:32:16.2958970Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.2958975Z 
2025-05-07T20:32:16.2959740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.2959907Z 
2025-05-07T20:32:16.2960023Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.2960245Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2960332Z     T=128,
2025-05-07T20:32:16.2960407Z     D=5120,
2025-05-07T20:32:16.2960488Z     scale_ub=None,
2025-05-07T20:32:16.2960580Z     contiguous=True,
2025-05-07T20:32:16.2960664Z     compiled=True,
2025-05-07T20:32:16.2960863Z )
2025-05-07T20:32:16.2961088Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.2961256Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.2961260Z 
2025-05-07T20:32:16.2961335Z     @given(
2025-05-07T20:32:16.2961460Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.2961562Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.2961684Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.2961807Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.2961922Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.2962008Z     )
2025-05-07T20:32:16.2962253Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.2962347Z     def test_silu_mul_quant(
2025-05-07T20:32:16.2962432Z         self,
2025-05-07T20:32:16.2962509Z         T: int,
2025-05-07T20:32:16.2962586Z         D: int,
2025-05-07T20:32:16.2962696Z         scale_ub: Optional[float],
2025-05-07T20:32:16.2962782Z         contiguous: bool,
2025-05-07T20:32:16.2962865Z         compiled: bool,
2025-05-07T20:32:16.2962947Z     ) -> None:
2025-05-07T20:32:16.2963043Z         torch.manual_seed(2025)
2025-05-07T20:32:16.2963118Z     
2025-05-07T20:32:16.2963284Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2963357Z     
2025-05-07T20:32:16.2963456Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.2963579Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.2963672Z         x = x_sign * x_clamp
2025-05-07T20:32:16.2963757Z         x0 = x[:, :D]
2025-05-07T20:32:16.2963836Z         x1 = x[:, D:]
2025-05-07T20:32:16.2963907Z     
2025-05-07T20:32:16.2963995Z         if contiguous:
2025-05-07T20:32:16.2964086Z             x0 = x0.contiguous()
2025-05-07T20:32:16.2964174Z             x1 = x1.contiguous()
2025-05-07T20:32:16.2964252Z     
2025-05-07T20:32:16.2964343Z         if scale_ub is not None:
2025-05-07T20:32:16.2964453Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.2964594Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.2964671Z             )
2025-05-07T20:32:16.2964749Z         else:
2025-05-07T20:32:16.2964842Z             scale_ub_tensor = None
2025-05-07T20:32:16.2964919Z     
2025-05-07T20:32:16.2965055Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2965146Z             op = silu_mul_quant
2025-05-07T20:32:16.2965230Z             if compiled:
2025-05-07T20:32:16.2965340Z                 op = torch.compile(op)
2025-05-07T20:32:16.2965446Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2965521Z     
2025-05-07T20:32:16.2965617Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.2965740Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.2965814Z     
2025-05-07T20:32:16.2965960Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2966067Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.2966170Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.2966291Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.2966431Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.2966513Z     
2025-05-07T20:32:16.2966614Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.2966618Z 
2025-05-07T20:32:16.2966714Z moe/activation_test.py:126: 
2025-05-07T20:32:16.2966936Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2967042Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.2967181Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.2967730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.2967830Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.2968278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.2968539Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.2968898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.2969161Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.2969533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.2969702Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.2970036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.2970115Z     fn()
2025-05-07T20:32:16.2970523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.2970611Z     self.fn.run(
2025-05-07T20:32:16.2970949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.2971044Z     kernel = self.compile(
2025-05-07T20:32:16.2971418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.2971598Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.2971729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2971734Z 
2025-05-07T20:32:16.2971934Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a58257740>
2025-05-07T20:32:16.2972700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.2973256Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a58bad620>}
2025-05-07T20:32:16.2980801Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.2981019Z context = <triton._C.libtriton.ir.context object at 0x7f6a58141bb0>
2025-05-07T20:32:16.2981034Z 
2025-05-07T20:32:16.2981205Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.2981463Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.2981571Z                            module_map=module_map)
2025-05-07T20:32:16.2981737Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.2981841Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.2981926Z E       ^
2025-05-07T20:32:16.2982280Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.2982285Z 
2025-05-07T20:32:16.2982702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.2982707Z 
2025-05-07T20:32:16.2982815Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.2983171Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2983251Z     T=4096,
2025-05-07T20:32:16.2983333Z     D=5120,
2025-05-07T20:32:16.2983415Z     scale_ub=None,
2025-05-07T20:32:16.2983497Z     contiguous=True,
2025-05-07T20:32:16.2983584Z     compiled=True,
2025-05-07T20:32:16.2983656Z )
2025-05-07T20:32:16.2983873Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.2984049Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.2984133Z 
2025-05-07T20:32:16.2984209Z     @given(
2025-05-07T20:32:16.2984337Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.2984435Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.2984548Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.2984675Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.2984786Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.2984862Z     )
2025-05-07T20:32:16.2985115Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.2985206Z     def test_silu_mul_quant(
2025-05-07T20:32:16.2985293Z         self,
2025-05-07T20:32:16.2985368Z         T: int,
2025-05-07T20:32:16.2985443Z         D: int,
2025-05-07T20:32:16.2985548Z         scale_ub: Optional[float],
2025-05-07T20:32:16.2985636Z         contiguous: bool,
2025-05-07T20:32:16.2985719Z         compiled: bool,
2025-05-07T20:32:16.2985807Z     ) -> None:
2025-05-07T20:32:16.2985902Z         torch.manual_seed(2025)
2025-05-07T20:32:16.2985975Z     
2025-05-07T20:32:16.2986147Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2986220Z     
2025-05-07T20:32:16.2986311Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.2986442Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.2986531Z         x = x_sign * x_clamp
2025-05-07T20:32:16.2986610Z         x0 = x[:, :D]
2025-05-07T20:32:16.2986698Z         x1 = x[:, D:]
2025-05-07T20:32:16.2986775Z     
2025-05-07T20:32:16.2986863Z         if contiguous:
2025-05-07T20:32:16.2986953Z             x0 = x0.contiguous()
2025-05-07T20:32:16.2987039Z             x1 = x1.contiguous()
2025-05-07T20:32:16.2987122Z     
2025-05-07T20:32:16.2987212Z         if scale_ub is not None:
2025-05-07T20:32:16.2987316Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.2987451Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.2987531Z             )
2025-05-07T20:32:16.2987601Z         else:
2025-05-07T20:32:16.2987703Z             scale_ub_tensor = None
2025-05-07T20:32:16.2987774Z     
2025-05-07T20:32:16.2987901Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2987995Z             op = silu_mul_quant
2025-05-07T20:32:16.2988076Z             if compiled:
2025-05-07T20:32:16.2988196Z                 op = torch.compile(op)
2025-05-07T20:32:16.2988311Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2988402Z     
2025-05-07T20:32:16.2988502Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.2988624Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.2988693Z     
2025-05-07T20:32:16.2988831Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2988936Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.2989036Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.2989170Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.2989309Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.2989383Z     
2025-05-07T20:32:16.2989481Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.2989485Z 
2025-05-07T20:32:16.2989580Z moe/activation_test.py:126: 
2025-05-07T20:32:16.2989713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2989818Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.2990036Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.2990597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.2990696Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.2991053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.2991349Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.2991708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.2991965Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.2992335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.2992504Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.2992846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.2992921Z     fn()
2025-05-07T20:32:16.2993324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.2993404Z     self.fn.run(
2025-05-07T20:32:16.2993735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.2993837Z     kernel = self.compile(
2025-05-07T20:32:16.2994211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.2994385Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.2994517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2994521Z 
2025-05-07T20:32:16.2994732Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a5846f0b0>
2025-05-07T20:32:16.2995502Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.2995998Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a585fcae0>}
2025-05-07T20:32:16.2996738Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.2996924Z context = <triton._C.libtriton.ir.context object at 0x7f6a43db7870>
2025-05-07T20:32:16.2996929Z 
2025-05-07T20:32:16.2997091Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.2997357Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.2997465Z                            module_map=module_map)
2025-05-07T20:32:16.2997629Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.2997731Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.2997810Z E       ^
2025-05-07T20:32:16.2998181Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.2998194Z 
2025-05-07T20:32:16.2998633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.2998638Z 
2025-05-07T20:32:16.2998744Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.2998970Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2999047Z     T=16384,
2025-05-07T20:32:16.2999131Z     D=5120,
2025-05-07T20:32:16.2999212Z     scale_ub=None,
2025-05-07T20:32:16.2999376Z     contiguous=True,
2025-05-07T20:32:16.2999466Z     compiled=True,
2025-05-07T20:32:16.2999759Z )
2025-05-07T20:32:16.2999978Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3000153Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.3002991Z 
2025-05-07T20:32:16.3003115Z     @given(
2025-05-07T20:32:16.3003247Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3003457Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3003570Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3003692Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3003805Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3003876Z     )
2025-05-07T20:32:16.3004127Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3004219Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3004300Z         self,
2025-05-07T20:32:16.3004382Z         T: int,
2025-05-07T20:32:16.3004456Z         D: int,
2025-05-07T20:32:16.3004554Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3004653Z         contiguous: bool,
2025-05-07T20:32:16.3004735Z         compiled: bool,
2025-05-07T20:32:16.3004821Z     ) -> None:
2025-05-07T20:32:16.3004917Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3004990Z     
2025-05-07T20:32:16.3005169Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3005243Z     
2025-05-07T20:32:16.3005335Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3005465Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3005560Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3005641Z         x0 = x[:, :D]
2025-05-07T20:32:16.3005730Z         x1 = x[:, D:]
2025-05-07T20:32:16.3005802Z     
2025-05-07T20:32:16.3005886Z         if contiguous:
2025-05-07T20:32:16.3005982Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3006075Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3006156Z     
2025-05-07T20:32:16.3006246Z         if scale_ub is not None:
2025-05-07T20:32:16.3006351Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3006487Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3006564Z             )
2025-05-07T20:32:16.3006638Z         else:
2025-05-07T20:32:16.3006734Z             scale_ub_tensor = None
2025-05-07T20:32:16.3006812Z     
2025-05-07T20:32:16.3006939Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3007034Z             op = silu_mul_quant
2025-05-07T20:32:16.3007116Z             if compiled:
2025-05-07T20:32:16.3007215Z                 op = torch.compile(op)
2025-05-07T20:32:16.3007326Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3007396Z     
2025-05-07T20:32:16.3007486Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.3007610Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.3007683Z     
2025-05-07T20:32:16.3007824Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3007926Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.3008022Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.3008150Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.3008287Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.3008365Z     
2025-05-07T20:32:16.3008472Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.3008477Z 
2025-05-07T20:32:16.3008571Z moe/activation_test.py:126: 
2025-05-07T20:32:16.3008704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3008808Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.3008940Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.3009584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.3009683Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.3010034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3010261Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3010621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.3010981Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.3011348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.3011513Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.3011854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.3011939Z     fn()
2025-05-07T20:32:16.3012330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.3012419Z     self.fn.run(
2025-05-07T20:32:16.3012751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3012848Z     kernel = self.compile(
2025-05-07T20:32:16.3013300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3013473Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3013602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3013607Z 
2025-05-07T20:32:16.3013810Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a58867b30>
2025-05-07T20:32:16.3014583Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3015078Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a586b2660>}
2025-05-07T20:32:16.3015811Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3016011Z context = <triton._C.libtriton.ir.context object at 0x7f6a437dcff0>
2025-05-07T20:32:16.3016016Z 
2025-05-07T20:32:16.3016178Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3016440Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3016546Z                            module_map=module_map)
2025-05-07T20:32:16.3016710Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3016819Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.3016898Z E       ^
2025-05-07T20:32:16.3017253Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3017258Z 
2025-05-07T20:32:16.3017662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3017671Z 
2025-05-07T20:32:16.3017772Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3018005Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3018091Z     T=1,
2025-05-07T20:32:16.3018189Z     D=5120,
2025-05-07T20:32:16.3018280Z     scale_ub=1200.0,
2025-05-07T20:32:16.3018383Z     contiguous=True,
2025-05-07T20:32:16.3018474Z     compiled=True,
2025-05-07T20:32:16.3018549Z )
2025-05-07T20:32:16.3018850Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3019021Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.3019026Z 
2025-05-07T20:32:16.3019101Z     @given(
2025-05-07T20:32:16.3019221Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3019326Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3019441Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3019668Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3019787Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3019863Z     )
2025-05-07T20:32:16.3020115Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3020206Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3020283Z         self,
2025-05-07T20:32:16.3020364Z         T: int,
2025-05-07T20:32:16.3020437Z         D: int,
2025-05-07T20:32:16.3020537Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3020632Z         contiguous: bool,
2025-05-07T20:32:16.3020717Z         compiled: bool,
2025-05-07T20:32:16.3020791Z     ) -> None:
2025-05-07T20:32:16.3020897Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3020972Z     
2025-05-07T20:32:16.3021139Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3021218Z     
2025-05-07T20:32:16.3021307Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3021442Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3021531Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3021613Z         x0 = x[:, :D]
2025-05-07T20:32:16.3021697Z         x1 = x[:, D:]
2025-05-07T20:32:16.3021768Z     
2025-05-07T20:32:16.3021853Z         if contiguous:
2025-05-07T20:32:16.3021950Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3022040Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3022110Z     
2025-05-07T20:32:16.3022205Z         if scale_ub is not None:
2025-05-07T20:32:16.3022318Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3022451Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3022530Z             )
2025-05-07T20:32:16.3022608Z         else:
2025-05-07T20:32:16.3022703Z             scale_ub_tensor = None
2025-05-07T20:32:16.3022782Z     
2025-05-07T20:32:16.3022910Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3023009Z             op = silu_mul_quant
2025-05-07T20:32:16.3023093Z             if compiled:
2025-05-07T20:32:16.3023193Z                 op = torch.compile(op)
2025-05-07T20:32:16.3023301Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3023375Z     
2025-05-07T20:32:16.3023463Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3023468Z 
2025-05-07T20:32:16.3023569Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3023696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3023802Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3023907Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3024269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3024371Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3024861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3024963Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3025320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3025542Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3025881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3025974Z     kernel = self.compile(
2025-05-07T20:32:16.3026432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3026612Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3026736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3026741Z 
2025-05-07T20:32:16.3026944Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43f4d6a0>
2025-05-07T20:32:16.3027785Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3028313Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43a7d9e0>}
2025-05-07T20:32:16.3029075Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3029261Z context = <triton._C.libtriton.ir.context object at 0x7f6a42aab430>
2025-05-07T20:32:16.3029265Z 
2025-05-07T20:32:16.3029437Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3029694Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3029804Z                            module_map=module_map)
2025-05-07T20:32:16.3029968Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3030064Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3030142Z E       ^
2025-05-07T20:32:16.3030495Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3030500Z 
2025-05-07T20:32:16.3030907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3030912Z 
2025-05-07T20:32:16.3031018Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3031238Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3031317Z     T=1,
2025-05-07T20:32:16.3031403Z     D=5120,
2025-05-07T20:32:16.3031485Z     scale_ub=None,
2025-05-07T20:32:16.3031572Z     contiguous=False,
2025-05-07T20:32:16.3031663Z     compiled=True,
2025-05-07T20:32:16.3031736Z )
2025-05-07T20:32:16.3031958Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3032122Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.3032126Z 
2025-05-07T20:32:16.3032200Z     @given(
2025-05-07T20:32:16.3032325Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3032427Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3032542Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3032668Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3032781Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3032859Z     )
2025-05-07T20:32:16.3033112Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3033205Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3033294Z         self,
2025-05-07T20:32:16.3033373Z         T: int,
2025-05-07T20:32:16.3033456Z         D: int,
2025-05-07T20:32:16.3033564Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3033657Z         contiguous: bool,
2025-05-07T20:32:16.3033742Z         compiled: bool,
2025-05-07T20:32:16.3033823Z     ) -> None:
2025-05-07T20:32:16.3033919Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3033989Z     
2025-05-07T20:32:16.3034160Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3034232Z     
2025-05-07T20:32:16.3034321Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3034530Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3034624Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3034712Z         x0 = x[:, :D]
2025-05-07T20:32:16.3034792Z         x1 = x[:, D:]
2025-05-07T20:32:16.3034864Z     
2025-05-07T20:32:16.3034949Z         if contiguous:
2025-05-07T20:32:16.3035039Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3035127Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3035281Z     
2025-05-07T20:32:16.3035373Z         if scale_ub is not None:
2025-05-07T20:32:16.3035480Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3035616Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3035692Z             )
2025-05-07T20:32:16.3035769Z         else:
2025-05-07T20:32:16.3035866Z             scale_ub_tensor = None
2025-05-07T20:32:16.3035940Z     
2025-05-07T20:32:16.3036068Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3036161Z             op = silu_mul_quant
2025-05-07T20:32:16.3036249Z             if compiled:
2025-05-07T20:32:16.3036355Z                 op = torch.compile(op)
2025-05-07T20:32:16.3036462Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3036535Z     
2025-05-07T20:32:16.3036632Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.3036754Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.3036827Z     
2025-05-07T20:32:16.3036965Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3037070Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.3037166Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.3037293Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.3037435Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.3037515Z     
2025-05-07T20:32:16.3037613Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.3037617Z 
2025-05-07T20:32:16.3037712Z moe/activation_test.py:126: 
2025-05-07T20:32:16.3037845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3037948Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.3038079Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.3038632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.3038737Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.3039097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3039314Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3039672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.3039934Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.3040303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.3040467Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.3040816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.3040901Z     fn()
2025-05-07T20:32:16.3041294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.3041380Z     self.fn.run(
2025-05-07T20:32:16.3041720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3041812Z     kernel = self.compile(
2025-05-07T20:32:16.3042182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3042474Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3042601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3042606Z 
2025-05-07T20:32:16.3042814Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43814740>
2025-05-07T20:32:16.3043575Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3044152Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43a7f1a0>}
2025-05-07T20:32:16.3044880Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3045073Z context = <triton._C.libtriton.ir.context object at 0x7f6a42a0ce30>
2025-05-07T20:32:16.3045078Z 
2025-05-07T20:32:16.3045247Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3045504Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3045617Z                            module_map=module_map)
2025-05-07T20:32:16.3045776Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3045884Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.3045963Z E       ^
2025-05-07T20:32:16.3046312Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3046316Z 
2025-05-07T20:32:16.3046722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3046732Z 
2025-05-07T20:32:16.3046833Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3047055Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3047142Z     T=1,
2025-05-07T20:32:16.3047223Z     D=5120,
2025-05-07T20:32:16.3047306Z     scale_ub=None,
2025-05-07T20:32:16.3047394Z     contiguous=True,
2025-05-07T20:32:16.3047476Z     compiled=False,
2025-05-07T20:32:16.3047545Z )
2025-05-07T20:32:16.3047765Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3047929Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.3047934Z 
2025-05-07T20:32:16.3048011Z     @given(
2025-05-07T20:32:16.3048128Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3048230Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3048358Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3048485Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3048624Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3048701Z     )
2025-05-07T20:32:16.3048941Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3049033Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3049114Z         self,
2025-05-07T20:32:16.3049187Z         T: int,
2025-05-07T20:32:16.3049262Z         D: int,
2025-05-07T20:32:16.3049367Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3049453Z         contiguous: bool,
2025-05-07T20:32:16.3049546Z         compiled: bool,
2025-05-07T20:32:16.3049622Z     ) -> None:
2025-05-07T20:32:16.3049715Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3049791Z     
2025-05-07T20:32:16.3049959Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3050030Z     
2025-05-07T20:32:16.3050124Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3050245Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3050330Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3050416Z         x0 = x[:, :D]
2025-05-07T20:32:16.3050573Z         x1 = x[:, D:]
2025-05-07T20:32:16.3050648Z     
2025-05-07T20:32:16.3050735Z         if contiguous:
2025-05-07T20:32:16.3050825Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3050913Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3050990Z     
2025-05-07T20:32:16.3051080Z         if scale_ub is not None:
2025-05-07T20:32:16.3051190Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3051399Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3051473Z             )
2025-05-07T20:32:16.3051557Z         else:
2025-05-07T20:32:16.3051649Z             scale_ub_tensor = None
2025-05-07T20:32:16.3051718Z     
2025-05-07T20:32:16.3051849Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3051937Z             op = silu_mul_quant
2025-05-07T20:32:16.3052019Z             if compiled:
2025-05-07T20:32:16.3052124Z                 op = torch.compile(op)
2025-05-07T20:32:16.3052234Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3052302Z     
2025-05-07T20:32:16.3052394Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3052399Z 
2025-05-07T20:32:16.3052495Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3052625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3052724Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3052824Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3053371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3053468Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3053824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3054049Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3054389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3054484Z     kernel = self.compile(
2025-05-07T20:32:16.3054859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3055031Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3055160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3055170Z 
2025-05-07T20:32:16.3055371Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43e24800>
2025-05-07T20:32:16.3056135Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3056632Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43a7e700>}
2025-05-07T20:32:16.3057361Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3057552Z context = <triton._C.libtriton.ir.context object at 0x7f6a432ce8f0>
2025-05-07T20:32:16.3057556Z 
2025-05-07T20:32:16.3057723Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3057987Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3058096Z                            module_map=module_map)
2025-05-07T20:32:16.3058282Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3058395Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3058488Z E       ^
2025-05-07T20:32:16.3058922Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3058927Z 
2025-05-07T20:32:16.3059613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3059622Z 
2025-05-07T20:32:16.3059743Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3059976Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3060213Z     T=128,
2025-05-07T20:32:16.3060292Z     D=5120,
2025-05-07T20:32:16.3060384Z     scale_ub=None,
2025-05-07T20:32:16.3060473Z     contiguous=False,
2025-05-07T20:32:16.3060563Z     compiled=True,
2025-05-07T20:32:16.3060636Z )
2025-05-07T20:32:16.3060854Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3061026Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.3061030Z 
2025-05-07T20:32:16.3061109Z     @given(
2025-05-07T20:32:16.3061231Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3061336Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3061449Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3061565Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3061684Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3061757Z     )
2025-05-07T20:32:16.3062004Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3062105Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3062182Z         self,
2025-05-07T20:32:16.3062265Z         T: int,
2025-05-07T20:32:16.3062342Z         D: int,
2025-05-07T20:32:16.3062439Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3062535Z         contiguous: bool,
2025-05-07T20:32:16.3062622Z         compiled: bool,
2025-05-07T20:32:16.3062702Z     ) -> None:
2025-05-07T20:32:16.3062803Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3062878Z     
2025-05-07T20:32:16.3063048Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3063128Z     
2025-05-07T20:32:16.3063223Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3063356Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3063447Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3063528Z         x0 = x[:, :D]
2025-05-07T20:32:16.3063615Z         x1 = x[:, D:]
2025-05-07T20:32:16.3063690Z     
2025-05-07T20:32:16.3063780Z         if contiguous:
2025-05-07T20:32:16.3063876Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3063965Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3064040Z     
2025-05-07T20:32:16.3064135Z         if scale_ub is not None:
2025-05-07T20:32:16.3064239Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3064374Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3064458Z             )
2025-05-07T20:32:16.3064536Z         else:
2025-05-07T20:32:16.3064635Z             scale_ub_tensor = None
2025-05-07T20:32:16.3064713Z     
2025-05-07T20:32:16.3064842Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3064937Z             op = silu_mul_quant
2025-05-07T20:32:16.3065024Z             if compiled:
2025-05-07T20:32:16.3065122Z                 op = torch.compile(op)
2025-05-07T20:32:16.3065233Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3065309Z     
2025-05-07T20:32:16.3065400Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3065408Z 
2025-05-07T20:32:16.3065512Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3065639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3065745Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3065845Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3066207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3066306Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3066925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3067029Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3067385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3067605Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3068017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3068110Z     kernel = self.compile(
2025-05-07T20:32:16.3068484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3068659Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3068784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3068794Z 
2025-05-07T20:32:16.3068996Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a4387d760>
2025-05-07T20:32:16.3069763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3070259Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43a7df80>}
2025-05-07T20:32:16.3071001Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3071192Z context = <triton._C.libtriton.ir.context object at 0x7f6a432c0030>
2025-05-07T20:32:16.3071196Z 
2025-05-07T20:32:16.3071367Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3071620Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3071726Z                            module_map=module_map)
2025-05-07T20:32:16.3071891Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3071990Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3072070Z E       ^
2025-05-07T20:32:16.3072429Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3072433Z 
2025-05-07T20:32:16.3072839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3072843Z 
2025-05-07T20:32:16.3072949Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3073170Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3073250Z     T=128,
2025-05-07T20:32:16.3073337Z     D=7168,
2025-05-07T20:32:16.3073422Z     scale_ub=1200.0,
2025-05-07T20:32:16.3073509Z     contiguous=False,
2025-05-07T20:32:16.3073600Z     compiled=False,
2025-05-07T20:32:16.3073676Z )
2025-05-07T20:32:16.3073895Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3074067Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.3074077Z 
2025-05-07T20:32:16.3074152Z     @given(
2025-05-07T20:32:16.3074274Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3074374Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3074488Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3074611Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3074728Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3074804Z     )
2025-05-07T20:32:16.3075052Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3075252Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3075335Z         self,
2025-05-07T20:32:16.3075412Z         T: int,
2025-05-07T20:32:16.3075488Z         D: int,
2025-05-07T20:32:16.3075589Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3075681Z         contiguous: bool,
2025-05-07T20:32:16.3075766Z         compiled: bool,
2025-05-07T20:32:16.3075846Z     ) -> None:
2025-05-07T20:32:16.3076019Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3076093Z     
2025-05-07T20:32:16.3076271Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3076342Z     
2025-05-07T20:32:16.3076431Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3076560Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3076650Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3076740Z         x0 = x[:, :D]
2025-05-07T20:32:16.3076821Z         x1 = x[:, D:]
2025-05-07T20:32:16.3076893Z     
2025-05-07T20:32:16.3076985Z         if contiguous:
2025-05-07T20:32:16.3077076Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3077165Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3077244Z     
2025-05-07T20:32:16.3077334Z         if scale_ub is not None:
2025-05-07T20:32:16.3077437Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3077576Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3077652Z             )
2025-05-07T20:32:16.3077730Z         else:
2025-05-07T20:32:16.3077830Z             scale_ub_tensor = None
2025-05-07T20:32:16.3077904Z     
2025-05-07T20:32:16.3078043Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3078143Z             op = silu_mul_quant
2025-05-07T20:32:16.3078238Z             if compiled:
2025-05-07T20:32:16.3078357Z                 op = torch.compile(op)
2025-05-07T20:32:16.3078468Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3078541Z     
2025-05-07T20:32:16.3078639Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3078648Z 
2025-05-07T20:32:16.3078744Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3078871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3078976Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3079075Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3079571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3079673Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3080026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3080249Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3080583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3080677Z     kernel = self.compile(
2025-05-07T20:32:16.3081062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3081236Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3081372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3081376Z 
2025-05-07T20:32:16.3081579Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a4387e060>
2025-05-07T20:32:16.3082342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3082846Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43450360>}
2025-05-07T20:32:16.3083657Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3083855Z context = <triton._C.libtriton.ir.context object at 0x7f6a432c6f70>
2025-05-07T20:32:16.3083859Z 
2025-05-07T20:32:16.3084022Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3084278Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3084509Z                            module_map=module_map)
2025-05-07T20:32:16.3084669Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3084772Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3084847Z E       ^
2025-05-07T20:32:16.3085199Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3085204Z 
2025-05-07T20:32:16.3085617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3085622Z 
2025-05-07T20:32:16.3085724Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3085950Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3086024Z     T=128,
2025-05-07T20:32:16.3086099Z     D=5120,
2025-05-07T20:32:16.3086183Z     scale_ub=None,
2025-05-07T20:32:16.3086277Z     contiguous=False,
2025-05-07T20:32:16.3086358Z     compiled=False,
2025-05-07T20:32:16.3086433Z )
2025-05-07T20:32:16.3086648Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3086815Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.3086828Z 
2025-05-07T20:32:16.3086904Z     @given(
2025-05-07T20:32:16.3087022Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3087129Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3087251Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3087365Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3087483Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3087557Z     )
2025-05-07T20:32:16.3087797Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3087895Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3087972Z         self,
2025-05-07T20:32:16.3088056Z         T: int,
2025-05-07T20:32:16.3088139Z         D: int,
2025-05-07T20:32:16.3088237Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3088332Z         contiguous: bool,
2025-05-07T20:32:16.3088417Z         compiled: bool,
2025-05-07T20:32:16.3088493Z     ) -> None:
2025-05-07T20:32:16.3088594Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3088666Z     
2025-05-07T20:32:16.3088833Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3088909Z     
2025-05-07T20:32:16.3089006Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3089130Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3089227Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3089307Z         x0 = x[:, :D]
2025-05-07T20:32:16.3089387Z         x1 = x[:, D:]
2025-05-07T20:32:16.3089463Z     
2025-05-07T20:32:16.3089546Z         if contiguous:
2025-05-07T20:32:16.3089637Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3089734Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3089802Z     
2025-05-07T20:32:16.3089895Z         if scale_ub is not None:
2025-05-07T20:32:16.3090003Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3090139Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3090222Z             )
2025-05-07T20:32:16.3090297Z         else:
2025-05-07T20:32:16.3090392Z             scale_ub_tensor = None
2025-05-07T20:32:16.3090465Z     
2025-05-07T20:32:16.3090594Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3090766Z             op = silu_mul_quant
2025-05-07T20:32:16.3090855Z             if compiled:
2025-05-07T20:32:16.3090954Z                 op = torch.compile(op)
2025-05-07T20:32:16.3091057Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3091135Z     
2025-05-07T20:32:16.3091227Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3091231Z 
2025-05-07T20:32:16.3091327Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3091534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3091635Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3091736Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3092225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3092323Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3092685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3092904Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3093293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3093386Z     kernel = self.compile(
2025-05-07T20:32:16.3093763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3093945Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3094075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3094079Z 
2025-05-07T20:32:16.3094291Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a434746b0>
2025-05-07T20:32:16.3095054Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3095550Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a434514e0>}
2025-05-07T20:32:16.3096284Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3096478Z context = <triton._C.libtriton.ir.context object at 0x7f6a42c62a30>
2025-05-07T20:32:16.3096483Z 
2025-05-07T20:32:16.3096650Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3096905Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3097012Z                            module_map=module_map)
2025-05-07T20:32:16.3097176Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3097280Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3097364Z E       ^
2025-05-07T20:32:16.3097710Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3097714Z 
2025-05-07T20:32:16.3098116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3098125Z 
2025-05-07T20:32:16.3098230Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3098447Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3098527Z     T=128,
2025-05-07T20:32:16.3098607Z     D=5120,
2025-05-07T20:32:16.3098689Z     scale_ub=1200.0,
2025-05-07T20:32:16.3098778Z     contiguous=True,
2025-05-07T20:32:16.3098861Z     compiled=False,
2025-05-07T20:32:16.3098934Z )
2025-05-07T20:32:16.3099154Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3099403Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.3099409Z 
2025-05-07T20:32:16.3099483Z     @given(
2025-05-07T20:32:16.3099608Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3099705Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3099823Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3099945Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3100159Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3100238Z     )
2025-05-07T20:32:16.3100482Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3100572Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3100653Z         self,
2025-05-07T20:32:16.3100730Z         T: int,
2025-05-07T20:32:16.3100804Z         D: int,
2025-05-07T20:32:16.3100907Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3100996Z         contiguous: bool,
2025-05-07T20:32:16.3101086Z         compiled: bool,
2025-05-07T20:32:16.3101166Z     ) -> None:
2025-05-07T20:32:16.3101261Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3101333Z     
2025-05-07T20:32:16.3101566Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3105802Z     
2025-05-07T20:32:16.3105904Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3106031Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3106133Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3106211Z         x0 = x[:, :D]
2025-05-07T20:32:16.3106295Z         x1 = x[:, D:]
2025-05-07T20:32:16.3106366Z     
2025-05-07T20:32:16.3106446Z         if contiguous:
2025-05-07T20:32:16.3106541Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3106628Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3106700Z     
2025-05-07T20:32:16.3106797Z         if scale_ub is not None:
2025-05-07T20:32:16.3106901Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3107039Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3107117Z             )
2025-05-07T20:32:16.3107193Z         else:
2025-05-07T20:32:16.3107283Z             scale_ub_tensor = None
2025-05-07T20:32:16.3107360Z     
2025-05-07T20:32:16.3107493Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3107586Z             op = silu_mul_quant
2025-05-07T20:32:16.3107669Z             if compiled:
2025-05-07T20:32:16.3107770Z                 op = torch.compile(op)
2025-05-07T20:32:16.3107879Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3107949Z     
2025-05-07T20:32:16.3108040Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3108047Z 
2025-05-07T20:32:16.3108167Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3108315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3108415Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3108516Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3109016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3109118Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3109474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3109692Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3110037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3110131Z     kernel = self.compile(
2025-05-07T20:32:16.3110506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3110685Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3110809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3110917Z 
2025-05-07T20:32:16.3111126Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43475640>
2025-05-07T20:32:16.3111885Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3112385Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43453560>}
2025-05-07T20:32:16.3113190Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3113377Z context = <triton._C.libtriton.ir.context object at 0x7f6a428ea630>
2025-05-07T20:32:16.3113381Z 
2025-05-07T20:32:16.3113551Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3113810Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3113918Z                            module_map=module_map)
2025-05-07T20:32:16.3114076Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3114175Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3114261Z E       ^
2025-05-07T20:32:16.3114613Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3114618Z 
2025-05-07T20:32:16.3115023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3115031Z 
2025-05-07T20:32:16.3115130Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3115345Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3115424Z     T=1,
2025-05-07T20:32:16.3115500Z     D=7168,
2025-05-07T20:32:16.3115583Z     scale_ub=1200.0,
2025-05-07T20:32:16.3115674Z     contiguous=True,
2025-05-07T20:32:16.3115756Z     compiled=True,
2025-05-07T20:32:16.3115832Z )
2025-05-07T20:32:16.3116048Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3116207Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.3116217Z 
2025-05-07T20:32:16.3116295Z     @given(
2025-05-07T20:32:16.3116419Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3116515Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3116635Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3116749Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3116860Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3116938Z     )
2025-05-07T20:32:16.3117182Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3117271Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3117343Z         self,
2025-05-07T20:32:16.3117417Z         T: int,
2025-05-07T20:32:16.3117492Z         D: int,
2025-05-07T20:32:16.3117591Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3117678Z         contiguous: bool,
2025-05-07T20:32:16.3117766Z         compiled: bool,
2025-05-07T20:32:16.3117844Z     ) -> None:
2025-05-07T20:32:16.3117940Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3118021Z     
2025-05-07T20:32:16.3118209Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3118283Z     
2025-05-07T20:32:16.3118397Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3118524Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3118611Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3118691Z         x0 = x[:, :D]
2025-05-07T20:32:16.3118768Z         x1 = x[:, D:]
2025-05-07T20:32:16.3118840Z     
2025-05-07T20:32:16.3119007Z         if contiguous:
2025-05-07T20:32:16.3119100Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3119187Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3119258Z     
2025-05-07T20:32:16.3119345Z         if scale_ub is not None:
2025-05-07T20:32:16.3119449Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3119580Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3119652Z             )
2025-05-07T20:32:16.3119807Z         else:
2025-05-07T20:32:16.3119899Z             scale_ub_tensor = None
2025-05-07T20:32:16.3119970Z     
2025-05-07T20:32:16.3120102Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3120190Z             op = silu_mul_quant
2025-05-07T20:32:16.3120269Z             if compiled:
2025-05-07T20:32:16.3120370Z                 op = torch.compile(op)
2025-05-07T20:32:16.3120470Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3120540Z     
2025-05-07T20:32:16.3120632Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3120642Z 
2025-05-07T20:32:16.3120735Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3120866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3120963Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3121058Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3121423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3121516Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3122000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3122099Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3122445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3122665Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3122998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3123088Z     kernel = self.compile(
2025-05-07T20:32:16.3123461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3123628Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3123759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3123764Z 
2025-05-07T20:32:16.3123963Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43476bd0>
2025-05-07T20:32:16.3124719Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3125219Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a42ce42c0>}
2025-05-07T20:32:16.3125953Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3126140Z context = <triton._C.libtriton.ir.context object at 0x7f6a42c0edb0>
2025-05-07T20:32:16.3126149Z 
2025-05-07T20:32:16.3126308Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3126562Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3126668Z                            module_map=module_map)
2025-05-07T20:32:16.3126830Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3126928Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3127004Z E       ^
2025-05-07T20:32:16.3127431Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3127436Z 
2025-05-07T20:32:16.3127844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3127849Z 
2025-05-07T20:32:16.3127946Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3128171Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3128342Z     T=1,
2025-05-07T20:32:16.3128420Z     D=7168,
2025-05-07T20:32:16.3128520Z     scale_ub=1200.0,
2025-05-07T20:32:16.3128607Z     contiguous=False,
2025-05-07T20:32:16.3128688Z     compiled=True,
2025-05-07T20:32:16.3128761Z )
2025-05-07T20:32:16.3128971Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3129131Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.3129135Z 
2025-05-07T20:32:16.3129215Z     @given(
2025-05-07T20:32:16.3129331Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3129433Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3129543Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3129656Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3129766Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3129837Z     )
2025-05-07T20:32:16.3130082Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3130175Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3130246Z         self,
2025-05-07T20:32:16.3130322Z         T: int,
2025-05-07T20:32:16.3130400Z         D: int,
2025-05-07T20:32:16.3130495Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3130579Z         contiguous: bool,
2025-05-07T20:32:16.3130662Z         compiled: bool,
2025-05-07T20:32:16.3130738Z     ) -> None:
2025-05-07T20:32:16.3130834Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3130910Z     
2025-05-07T20:32:16.3131074Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3131149Z     
2025-05-07T20:32:16.3131236Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3131356Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3131445Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3131522Z         x0 = x[:, :D]
2025-05-07T20:32:16.3131604Z         x1 = x[:, D:]
2025-05-07T20:32:16.3131678Z     
2025-05-07T20:32:16.3131757Z         if contiguous:
2025-05-07T20:32:16.3131848Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3131938Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3132008Z     
2025-05-07T20:32:16.3132093Z         if scale_ub is not None:
2025-05-07T20:32:16.3132197Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3132327Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3132410Z             )
2025-05-07T20:32:16.3132483Z         else:
2025-05-07T20:32:16.3132579Z             scale_ub_tensor = None
2025-05-07T20:32:16.3132650Z     
2025-05-07T20:32:16.3132776Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3132863Z             op = silu_mul_quant
2025-05-07T20:32:16.3132949Z             if compiled:
2025-05-07T20:32:16.3133144Z                 op = torch.compile(op)
2025-05-07T20:32:16.3133247Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3133323Z     
2025-05-07T20:32:16.3133409Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3133414Z 
2025-05-07T20:32:16.3133510Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3133637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3133733Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3133830Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3134191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3134388Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3134876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3134971Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3135322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3135618Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3135947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3136044Z     kernel = self.compile(
2025-05-07T20:32:16.3136417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3136588Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3136720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3136725Z 
2025-05-07T20:32:16.3136924Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43477560>
2025-05-07T20:32:16.3137682Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3138182Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a42ce5120>}
2025-05-07T20:32:16.3138910Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3139095Z context = <triton._C.libtriton.ir.context object at 0x7f6a429148b0>
2025-05-07T20:32:16.3139103Z 
2025-05-07T20:32:16.3139263Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3139520Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3139625Z                            module_map=module_map)
2025-05-07T20:32:16.3139779Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3139879Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3139955Z E       ^
2025-05-07T20:32:16.3140303Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3140307Z 
2025-05-07T20:32:16.3140709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3140714Z 
2025-05-07T20:32:16.3140812Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3141034Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3141110Z     T=1,
2025-05-07T20:32:16.3141187Z     D=7168,
2025-05-07T20:32:16.3141267Z     scale_ub=None,
2025-05-07T20:32:16.3141350Z     contiguous=False,
2025-05-07T20:32:16.3141432Z     compiled=True,
2025-05-07T20:32:16.3141502Z )
2025-05-07T20:32:16.3141714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3141878Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.3141887Z 
2025-05-07T20:32:16.3141960Z     @given(
2025-05-07T20:32:16.3142075Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3142173Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3142283Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3142404Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3142514Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3142585Z     )
2025-05-07T20:32:16.3142908Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3143005Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3143081Z         self,
2025-05-07T20:32:16.3143159Z         T: int,
2025-05-07T20:32:16.3143236Z         D: int,
2025-05-07T20:32:16.3143336Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3143431Z         contiguous: bool,
2025-05-07T20:32:16.3143518Z         compiled: bool,
2025-05-07T20:32:16.3143669Z     ) -> None:
2025-05-07T20:32:16.3143769Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3143839Z     
2025-05-07T20:32:16.3144021Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3144098Z     
2025-05-07T20:32:16.3144191Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3144326Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3144415Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3144494Z         x0 = x[:, :D]
2025-05-07T20:32:16.3144578Z         x1 = x[:, D:]
2025-05-07T20:32:16.3144655Z     
2025-05-07T20:32:16.3144739Z         if contiguous:
2025-05-07T20:32:16.3144837Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3144927Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3145001Z     
2025-05-07T20:32:16.3145097Z         if scale_ub is not None:
2025-05-07T20:32:16.3145205Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3145347Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3145430Z             )
2025-05-07T20:32:16.3145506Z         else:
2025-05-07T20:32:16.3145603Z             scale_ub_tensor = None
2025-05-07T20:32:16.3145675Z     
2025-05-07T20:32:16.3145808Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3145900Z             op = silu_mul_quant
2025-05-07T20:32:16.3145983Z             if compiled:
2025-05-07T20:32:16.3146085Z                 op = torch.compile(op)
2025-05-07T20:32:16.3146196Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3146266Z     
2025-05-07T20:32:16.3146365Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.3146498Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.3146568Z     
2025-05-07T20:32:16.3146711Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3146819Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.3146921Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.3147050Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.3147202Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.3147274Z     
2025-05-07T20:32:16.3147378Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.3147382Z 
2025-05-07T20:32:16.3147486Z moe/activation_test.py:126: 
2025-05-07T20:32:16.3147622Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3147734Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.3147879Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.3148611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.3148714Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.3149136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3149390Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3149825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.3150121Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.3150568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.3150750Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.3151292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.3151369Z     fn()
2025-05-07T20:32:16.3151759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.3151841Z     self.fn.run(
2025-05-07T20:32:16.3152170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3152338Z     kernel = self.compile(
2025-05-07T20:32:16.3152707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3152875Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3153002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3153006Z 
2025-05-07T20:32:16.3153209Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43808a40>
2025-05-07T20:32:16.3153965Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3154462Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a438d7420>}
2025-05-07T20:32:16.3155193Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3155380Z context = <triton._C.libtriton.ir.context object at 0x7f6a585518b0>
2025-05-07T20:32:16.3155385Z 
2025-05-07T20:32:16.3155547Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3155808Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3155912Z                            module_map=module_map)
2025-05-07T20:32:16.3156070Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3156174Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.3156250Z E       ^
2025-05-07T20:32:16.3156593Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3156601Z 
2025-05-07T20:32:16.3157009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3157014Z 
2025-05-07T20:32:16.3157113Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3157332Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3157404Z     T=1,
2025-05-07T20:32:16.3157478Z     D=5120,
2025-05-07T20:32:16.3157562Z     scale_ub=1200.0,
2025-05-07T20:32:16.3157649Z     contiguous=False,
2025-05-07T20:32:16.3157730Z     compiled=True,
2025-05-07T20:32:16.3157803Z )
2025-05-07T20:32:16.3158016Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3158183Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.3158187Z 
2025-05-07T20:32:16.3158259Z     @given(
2025-05-07T20:32:16.3158375Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3158477Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3158590Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3158704Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3158816Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3158888Z     )
2025-05-07T20:32:16.3159126Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3159448Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3159723Z         self,
2025-05-07T20:32:16.3159809Z         T: int,
2025-05-07T20:32:16.3159886Z         D: int,
2025-05-07T20:32:16.3159981Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3160067Z         contiguous: bool,
2025-05-07T20:32:16.3160148Z         compiled: bool,
2025-05-07T20:32:16.3160220Z     ) -> None:
2025-05-07T20:32:16.3160313Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3160383Z     
2025-05-07T20:32:16.3160664Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3160741Z     
2025-05-07T20:32:16.3160829Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3160954Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3161045Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3161120Z         x0 = x[:, :D]
2025-05-07T20:32:16.3161197Z         x1 = x[:, D:]
2025-05-07T20:32:16.3161270Z     
2025-05-07T20:32:16.3161350Z         if contiguous:
2025-05-07T20:32:16.3161441Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3161533Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3161607Z     
2025-05-07T20:32:16.3161699Z         if scale_ub is not None:
2025-05-07T20:32:16.3161799Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3161930Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3162005Z             )
2025-05-07T20:32:16.3162078Z         else:
2025-05-07T20:32:16.3162169Z             scale_ub_tensor = None
2025-05-07T20:32:16.3162245Z     
2025-05-07T20:32:16.3162370Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3162456Z             op = silu_mul_quant
2025-05-07T20:32:16.3162540Z             if compiled:
2025-05-07T20:32:16.3162635Z                 op = torch.compile(op)
2025-05-07T20:32:16.3162737Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3162805Z     
2025-05-07T20:32:16.3162894Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3162899Z 
2025-05-07T20:32:16.3163002Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3163136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3163237Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3163335Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3163698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3163787Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3164275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3164372Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3164718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3164933Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3165273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3165364Z     kernel = self.compile(
2025-05-07T20:32:16.3165740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3165910Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3166031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3166041Z 
2025-05-07T20:32:16.3166244Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a4380b290>
2025-05-07T20:32:16.3166997Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3167600Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43a7fa60>}
2025-05-07T20:32:16.3168329Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3168514Z context = <triton._C.libtriton.ir.context object at 0x7f6a428f63b0>
2025-05-07T20:32:16.3168522Z 
2025-05-07T20:32:16.3168683Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3169011Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3169123Z                            module_map=module_map)
2025-05-07T20:32:16.3169284Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3169380Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3169461Z E       ^
2025-05-07T20:32:16.3169809Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3169814Z 
2025-05-07T20:32:16.3170219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3170224Z 
2025-05-07T20:32:16.3170321Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3170540Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3170614Z     T=1,
2025-05-07T20:32:16.3170693Z     D=5120,
2025-05-07T20:32:16.3170772Z     scale_ub=1200.0,
2025-05-07T20:32:16.3170858Z     contiguous=False,
2025-05-07T20:32:16.3170942Z     compiled=False,
2025-05-07T20:32:16.3171009Z )
2025-05-07T20:32:16.3171222Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3171384Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.3171389Z 
2025-05-07T20:32:16.3171462Z     @given(
2025-05-07T20:32:16.3171576Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3171675Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3171791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3171904Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3172013Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3172086Z     )
2025-05-07T20:32:16.3172325Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3172422Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3172500Z         self,
2025-05-07T20:32:16.3172572Z         T: int,
2025-05-07T20:32:16.3172647Z         D: int,
2025-05-07T20:32:16.3172742Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3172828Z         contiguous: bool,
2025-05-07T20:32:16.3172911Z         compiled: bool,
2025-05-07T20:32:16.3172985Z     ) -> None:
2025-05-07T20:32:16.3173127Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3173203Z     
2025-05-07T20:32:16.3173368Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3173439Z     
2025-05-07T20:32:16.3173530Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3173653Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3173737Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3173818Z         x0 = x[:, :D]
2025-05-07T20:32:16.3173896Z         x1 = x[:, D:]
2025-05-07T20:32:16.3173971Z     
2025-05-07T20:32:16.3174050Z         if contiguous:
2025-05-07T20:32:16.3174143Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3174231Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3174299Z     
2025-05-07T20:32:16.3174387Z         if scale_ub is not None:
2025-05-07T20:32:16.3174490Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3174621Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3174693Z             )
2025-05-07T20:32:16.3174765Z         else:
2025-05-07T20:32:16.3174855Z             scale_ub_tensor = None
2025-05-07T20:32:16.3174925Z     
2025-05-07T20:32:16.3175137Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3175226Z             op = silu_mul_quant
2025-05-07T20:32:16.3175306Z             if compiled:
2025-05-07T20:32:16.3175407Z                 op = torch.compile(op)
2025-05-07T20:32:16.3175509Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3175582Z     
2025-05-07T20:32:16.3175669Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3175749Z 
2025-05-07T20:32:16.3175844Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3175971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3176067Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3176160Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3176652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3176747Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3177104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3177322Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3177653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3177747Z     kernel = self.compile(
2025-05-07T20:32:16.3178125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3178318Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3178467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3178473Z 
2025-05-07T20:32:16.3178671Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a438156d0>
2025-05-07T20:32:16.3179435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3179927Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43e0f6a0>}
2025-05-07T20:32:16.3180657Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3180847Z context = <triton._C.libtriton.ir.context object at 0x7f6a428e69b0>
2025-05-07T20:32:16.3180852Z 
2025-05-07T20:32:16.3181010Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3181268Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3181377Z                            module_map=module_map)
2025-05-07T20:32:16.3181537Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3181632Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3181705Z E       ^
2025-05-07T20:32:16.3182054Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3182058Z 
2025-05-07T20:32:16.3182461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3182470Z 
2025-05-07T20:32:16.3182570Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3182788Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3182862Z     T=16384,
2025-05-07T20:32:16.3182940Z     D=5120,
2025-05-07T20:32:16.3183018Z     scale_ub=1200.0,
2025-05-07T20:32:16.3183101Z     contiguous=False,
2025-05-07T20:32:16.3183187Z     compiled=True,
2025-05-07T20:32:16.3183255Z )
2025-05-07T20:32:16.3183547Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3183724Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.3183729Z 
2025-05-07T20:32:16.3183804Z     @given(
2025-05-07T20:32:16.3183919Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3184017Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3184201Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3184319Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3184430Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3184500Z     )
2025-05-07T20:32:16.3184746Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3184838Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3184908Z         self,
2025-05-07T20:32:16.3184989Z         T: int,
2025-05-07T20:32:16.3185062Z         D: int,
2025-05-07T20:32:16.3185161Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3185251Z         contiguous: bool,
2025-05-07T20:32:16.3185332Z         compiled: bool,
2025-05-07T20:32:16.3185407Z     ) -> None:
2025-05-07T20:32:16.3185502Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3185574Z     
2025-05-07T20:32:16.3185743Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3185813Z     
2025-05-07T20:32:16.3185908Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3186032Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3186118Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3186193Z         x0 = x[:, :D]
2025-05-07T20:32:16.3186275Z         x1 = x[:, D:]
2025-05-07T20:32:16.3186346Z     
2025-05-07T20:32:16.3186424Z         if contiguous:
2025-05-07T20:32:16.3186515Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3186601Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3186668Z     
2025-05-07T20:32:16.3186765Z         if scale_ub is not None:
2025-05-07T20:32:16.3186867Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3187002Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3187075Z             )
2025-05-07T20:32:16.3187149Z         else:
2025-05-07T20:32:16.3187241Z             scale_ub_tensor = None
2025-05-07T20:32:16.3187311Z     
2025-05-07T20:32:16.3187437Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3187533Z             op = silu_mul_quant
2025-05-07T20:32:16.3187614Z             if compiled:
2025-05-07T20:32:16.3187710Z                 op = torch.compile(op)
2025-05-07T20:32:16.3187814Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3187883Z     
2025-05-07T20:32:16.3187970Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3187979Z 
2025-05-07T20:32:16.3188075Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3188198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3188302Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3188396Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3188753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3188843Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3189324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3189421Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3189772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3189988Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3190322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3190411Z     kernel = self.compile(
2025-05-07T20:32:16.3190870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3191044Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3191167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3191171Z 
2025-05-07T20:32:16.3191372Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43814e60>
2025-05-07T20:32:16.3192228Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3192722Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43f32b60>}
2025-05-07T20:32:16.3193462Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3193646Z context = <triton._C.libtriton.ir.context object at 0x7f6a42b19af0>
2025-05-07T20:32:16.3193651Z 
2025-05-07T20:32:16.3193816Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3194069Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3194176Z                            module_map=module_map)
2025-05-07T20:32:16.3194336Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3194433Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3194514Z E       ^
2025-05-07T20:32:16.3194860Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3194865Z 
2025-05-07T20:32:16.3195270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3195274Z 
2025-05-07T20:32:16.3195375Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3195592Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3195667Z     T=2048,
2025-05-07T20:32:16.3195742Z     D=7168,
2025-05-07T20:32:16.3195822Z     scale_ub=1200.0,
2025-05-07T20:32:16.3195906Z     contiguous=False,
2025-05-07T20:32:16.3195990Z     compiled=True,
2025-05-07T20:32:16.3196062Z )
2025-05-07T20:32:16.3196280Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3196449Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.3196453Z 
2025-05-07T20:32:16.3196528Z     @given(
2025-05-07T20:32:16.3196648Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3196742Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3196859Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3196976Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3197087Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3197162Z     )
2025-05-07T20:32:16.3197400Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3197489Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3197568Z         self,
2025-05-07T20:32:16.3197647Z         T: int,
2025-05-07T20:32:16.3197721Z         D: int,
2025-05-07T20:32:16.3197822Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3197907Z         contiguous: bool,
2025-05-07T20:32:16.3197989Z         compiled: bool,
2025-05-07T20:32:16.3198067Z     ) -> None:
2025-05-07T20:32:16.3198160Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3198232Z     
2025-05-07T20:32:16.3198403Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3198475Z     
2025-05-07T20:32:16.3198569Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3198768Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3198855Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3198936Z         x0 = x[:, :D]
2025-05-07T20:32:16.3199012Z         x1 = x[:, D:]
2025-05-07T20:32:16.3199080Z     
2025-05-07T20:32:16.3199163Z         if contiguous:
2025-05-07T20:32:16.3199250Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3199334Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3199487Z     
2025-05-07T20:32:16.3199575Z         if scale_ub is not None:
2025-05-07T20:32:16.3199677Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3199810Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3199880Z             )
2025-05-07T20:32:16.3199958Z         else:
2025-05-07T20:32:16.3200046Z             scale_ub_tensor = None
2025-05-07T20:32:16.3200114Z     
2025-05-07T20:32:16.3200246Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3200336Z             op = silu_mul_quant
2025-05-07T20:32:16.3200417Z             if compiled:
2025-05-07T20:32:16.3200517Z                 op = torch.compile(op)
2025-05-07T20:32:16.3200619Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3200690Z     
2025-05-07T20:32:16.3200782Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3200786Z 
2025-05-07T20:32:16.3200880Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3201005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3201109Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3201205Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3201570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3201656Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3202138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3202248Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3202595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3202819Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3203149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3203246Z     kernel = self.compile(
2025-05-07T20:32:16.3203620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3203789Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3203909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3203913Z 
2025-05-07T20:32:16.3204121Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43e27aa0>
2025-05-07T20:32:16.3204881Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3205376Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43f32200>}
2025-05-07T20:32:16.3206106Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3206291Z context = <triton._C.libtriton.ir.context object at 0x7f6a427c48f0>
2025-05-07T20:32:16.3206295Z 
2025-05-07T20:32:16.3206455Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3206788Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3206896Z                            module_map=module_map)
2025-05-07T20:32:16.3207051Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3207144Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3207219Z E       ^
2025-05-07T20:32:16.3207562Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3207642Z 
2025-05-07T20:32:16.3208049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3208053Z 
2025-05-07T20:32:16.3208168Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3208418Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3208492Z     T=1,
2025-05-07T20:32:16.3208567Z     D=5120,
2025-05-07T20:32:16.3208647Z     scale_ub=None,
2025-05-07T20:32:16.3208734Z     contiguous=False,
2025-05-07T20:32:16.3208819Z     compiled=False,
2025-05-07T20:32:16.3208893Z )
2025-05-07T20:32:16.3209103Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3209263Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.3209267Z 
2025-05-07T20:32:16.3209344Z     @given(
2025-05-07T20:32:16.3209458Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3209558Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3209671Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3209785Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3209899Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3209971Z     )
2025-05-07T20:32:16.3210211Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3210303Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3210374Z         self,
2025-05-07T20:32:16.3210448Z         T: int,
2025-05-07T20:32:16.3210529Z         D: int,
2025-05-07T20:32:16.3210623Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3210707Z         contiguous: bool,
2025-05-07T20:32:16.3210795Z         compiled: bool,
2025-05-07T20:32:16.3210868Z     ) -> None:
2025-05-07T20:32:16.3210958Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3211032Z     
2025-05-07T20:32:16.3211196Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3211275Z     
2025-05-07T20:32:16.3211367Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3211488Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3211576Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3211652Z         x0 = x[:, :D]
2025-05-07T20:32:16.3211727Z         x1 = x[:, D:]
2025-05-07T20:32:16.3211807Z     
2025-05-07T20:32:16.3211889Z         if contiguous:
2025-05-07T20:32:16.3211974Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3212062Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3212137Z     
2025-05-07T20:32:16.3212226Z         if scale_ub is not None:
2025-05-07T20:32:16.3212329Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3212459Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3212533Z             )
2025-05-07T20:32:16.3212607Z         else:
2025-05-07T20:32:16.3212699Z             scale_ub_tensor = None
2025-05-07T20:32:16.3212772Z     
2025-05-07T20:32:16.3212900Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3212986Z             op = silu_mul_quant
2025-05-07T20:32:16.3213125Z             if compiled:
2025-05-07T20:32:16.3213221Z                 op = torch.compile(op)
2025-05-07T20:32:16.3213321Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3213395Z     
2025-05-07T20:32:16.3213481Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3213486Z 
2025-05-07T20:32:16.3213577Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3213788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3213888Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3213987Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3214473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3214566Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3214919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3215213Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3215543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3215636Z     kernel = self.compile(
2025-05-07T20:32:16.3216007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3216182Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3216302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3216307Z 
2025-05-07T20:32:16.3216504Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43e246b0>
2025-05-07T20:32:16.3217260Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3217757Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a581c39c0>}
2025-05-07T20:32:16.3218490Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3218679Z context = <triton._C.libtriton.ir.context object at 0x7f6a4297d3f0>
2025-05-07T20:32:16.3218684Z 
2025-05-07T20:32:16.3218846Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3219099Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3219207Z                            module_map=module_map)
2025-05-07T20:32:16.3219372Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3219466Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3219541Z E       ^
2025-05-07T20:32:16.3219888Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3219893Z 
2025-05-07T20:32:16.3220294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3220299Z 
2025-05-07T20:32:16.3220404Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3220620Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3220695Z     T=4096,
2025-05-07T20:32:16.3220773Z     D=7168,
2025-05-07T20:32:16.3220853Z     scale_ub=1200.0,
2025-05-07T20:32:16.3220938Z     contiguous=False,
2025-05-07T20:32:16.3221020Z     compiled=False,
2025-05-07T20:32:16.3221090Z )
2025-05-07T20:32:16.3221300Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3221477Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.3221482Z 
2025-05-07T20:32:16.3221554Z     @given(
2025-05-07T20:32:16.3221671Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3221767Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3221875Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3221992Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3222183Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3222254Z     )
2025-05-07T20:32:16.3222497Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3222590Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3222663Z         self,
2025-05-07T20:32:16.3222743Z         T: int,
2025-05-07T20:32:16.3223059Z         D: int,
2025-05-07T20:32:16.3223193Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3223394Z         contiguous: bool,
2025-05-07T20:32:16.3223547Z         compiled: bool,
2025-05-07T20:32:16.3223680Z     ) -> None:
2025-05-07T20:32:16.3227678Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3227762Z     
2025-05-07T20:32:16.3227939Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3228017Z     
2025-05-07T20:32:16.3228108Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3228242Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3228355Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3228447Z         x0 = x[:, :D]
2025-05-07T20:32:16.3228536Z         x1 = x[:, D:]
2025-05-07T20:32:16.3228609Z     
2025-05-07T20:32:16.3228690Z         if contiguous:
2025-05-07T20:32:16.3228778Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3228870Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3228940Z     
2025-05-07T20:32:16.3229032Z         if scale_ub is not None:
2025-05-07T20:32:16.3229149Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3229296Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3229371Z             )
2025-05-07T20:32:16.3229448Z         else:
2025-05-07T20:32:16.3229543Z             scale_ub_tensor = None
2025-05-07T20:32:16.3229618Z     
2025-05-07T20:32:16.3229754Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3229844Z             op = silu_mul_quant
2025-05-07T20:32:16.3229934Z             if compiled:
2025-05-07T20:32:16.3230039Z                 op = torch.compile(op)
2025-05-07T20:32:16.3230147Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3230222Z     
2025-05-07T20:32:16.3230312Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3230317Z 
2025-05-07T20:32:16.3230420Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3230560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3230662Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3230776Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3231376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3231470Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3231830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3232046Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3232388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3232481Z     kernel = self.compile(
2025-05-07T20:32:16.3232853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3233026Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3233153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3233157Z 
2025-05-07T20:32:16.3233363Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a439f06e0>
2025-05-07T20:32:16.3234121Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3234717Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a583a49a0>}
2025-05-07T20:32:16.3235452Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3235636Z context = <triton._C.libtriton.ir.context object at 0x7f6a42ee8530>
2025-05-07T20:32:16.3235717Z 
2025-05-07T20:32:16.3235883Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3236139Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3236243Z                            module_map=module_map)
2025-05-07T20:32:16.3236407Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3236504Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3236577Z E       ^
2025-05-07T20:32:16.3236931Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3236936Z 
2025-05-07T20:32:16.3237337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3237342Z 
2025-05-07T20:32:16.3237445Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3237660Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3237741Z     T=16384,
2025-05-07T20:32:16.3237822Z     D=7168,
2025-05-07T20:32:16.3237905Z     scale_ub=None,
2025-05-07T20:32:16.3237994Z     contiguous=True,
2025-05-07T20:32:16.3238089Z     compiled=True,
2025-05-07T20:32:16.3238170Z )
2025-05-07T20:32:16.3238411Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3238582Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.3238587Z 
2025-05-07T20:32:16.3238658Z     @given(
2025-05-07T20:32:16.3238783Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3238880Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3238990Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3239109Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3239220Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3239295Z     )
2025-05-07T20:32:16.3239540Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3239631Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3239712Z         self,
2025-05-07T20:32:16.3239787Z         T: int,
2025-05-07T20:32:16.3239861Z         D: int,
2025-05-07T20:32:16.3239961Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3240047Z         contiguous: bool,
2025-05-07T20:32:16.3240130Z         compiled: bool,
2025-05-07T20:32:16.3240212Z     ) -> None:
2025-05-07T20:32:16.3240304Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3240378Z     
2025-05-07T20:32:16.3240543Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3240614Z     
2025-05-07T20:32:16.3240703Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3240830Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3240914Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3240996Z         x0 = x[:, :D]
2025-05-07T20:32:16.3241081Z         x1 = x[:, D:]
2025-05-07T20:32:16.3241152Z     
2025-05-07T20:32:16.3241234Z         if contiguous:
2025-05-07T20:32:16.3241323Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3241407Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3241476Z     
2025-05-07T20:32:16.3241563Z         if scale_ub is not None:
2025-05-07T20:32:16.3241664Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3241798Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3241873Z             )
2025-05-07T20:32:16.3241944Z         else:
2025-05-07T20:32:16.3242119Z             scale_ub_tensor = None
2025-05-07T20:32:16.3242191Z     
2025-05-07T20:32:16.3242323Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3242413Z             op = silu_mul_quant
2025-05-07T20:32:16.3242493Z             if compiled:
2025-05-07T20:32:16.3242590Z                 op = torch.compile(op)
2025-05-07T20:32:16.3242691Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3242904Z     
2025-05-07T20:32:16.3242998Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3243002Z 
2025-05-07T20:32:16.3243095Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3243220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3243321Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3243415Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3243780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3243874Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3244355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3244454Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3244801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3245017Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3245356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3245446Z     kernel = self.compile(
2025-05-07T20:32:16.3245819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3245989Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3246116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3246121Z 
2025-05-07T20:32:16.3246323Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43f4e810>
2025-05-07T20:32:16.3247081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3247583Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a59351da0>}
2025-05-07T20:32:16.3248316Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3248501Z context = <triton._C.libtriton.ir.context object at 0x7f6a427ef3f0>
2025-05-07T20:32:16.3248508Z 
2025-05-07T20:32:16.3248673Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3248928Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3249037Z                            module_map=module_map)
2025-05-07T20:32:16.3249193Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3249287Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3249369Z E       ^
2025-05-07T20:32:16.3249712Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3249717Z 
2025-05-07T20:32:16.3250122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3250126Z 
2025-05-07T20:32:16.3250225Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3250522Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3250598Z     T=4096,
2025-05-07T20:32:16.3250674Z     D=5120,
2025-05-07T20:32:16.3250755Z     scale_ub=None,
2025-05-07T20:32:16.3250838Z     contiguous=False,
2025-05-07T20:32:16.3250914Z     compiled=True,
2025-05-07T20:32:16.3250985Z )
2025-05-07T20:32:16.3251204Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3251373Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.3251452Z 
2025-05-07T20:32:16.3251533Z     @given(
2025-05-07T20:32:16.3251648Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3251746Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3251858Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3251972Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3252081Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3252154Z     )
2025-05-07T20:32:16.3252397Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3252491Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3252567Z         self,
2025-05-07T20:32:16.3252638Z         T: int,
2025-05-07T20:32:16.3252712Z         D: int,
2025-05-07T20:32:16.3252808Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3252895Z         contiguous: bool,
2025-05-07T20:32:16.3252983Z         compiled: bool,
2025-05-07T20:32:16.3253138Z     ) -> None:
2025-05-07T20:32:16.3253228Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3253301Z     
2025-05-07T20:32:16.3253463Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3253532Z     
2025-05-07T20:32:16.3253626Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3253747Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3253833Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3253914Z         x0 = x[:, :D]
2025-05-07T20:32:16.3253990Z         x1 = x[:, D:]
2025-05-07T20:32:16.3254071Z     
2025-05-07T20:32:16.3254151Z         if contiguous:
2025-05-07T20:32:16.3254237Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3254323Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3254392Z     
2025-05-07T20:32:16.3254478Z         if scale_ub is not None:
2025-05-07T20:32:16.3254587Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3254716Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3254795Z             )
2025-05-07T20:32:16.3254870Z         else:
2025-05-07T20:32:16.3254963Z             scale_ub_tensor = None
2025-05-07T20:32:16.3255035Z     
2025-05-07T20:32:16.3255166Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3255253Z             op = silu_mul_quant
2025-05-07T20:32:16.3255338Z             if compiled:
2025-05-07T20:32:16.3255435Z                 op = torch.compile(op)
2025-05-07T20:32:16.3255537Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3255612Z     
2025-05-07T20:32:16.3255704Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3255708Z 
2025-05-07T20:32:16.3255801Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3255930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3256030Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3256124Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3256485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3256580Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3257063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3257155Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3257501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3257831Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3258163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3258258Z     kernel = self.compile(
2025-05-07T20:32:16.3258630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3258799Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3259004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3259008Z 
2025-05-07T20:32:16.3259411Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43f4c5c0>
2025-05-07T20:32:16.3260244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3260751Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a59353b00>}
2025-05-07T20:32:16.3261481Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3261674Z context = <triton._C.libtriton.ir.context object at 0x7f6a42535770>
2025-05-07T20:32:16.3261679Z 
2025-05-07T20:32:16.3261840Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3262093Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3262198Z                            module_map=module_map)
2025-05-07T20:32:16.3262356Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3262455Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3262534Z E       ^
2025-05-07T20:32:16.3262880Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3262884Z 
2025-05-07T20:32:16.3263289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3263293Z 
2025-05-07T20:32:16.3263394Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3263618Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3263693Z     T=4096,
2025-05-07T20:32:16.3263764Z     D=5120,
2025-05-07T20:32:16.3263846Z     scale_ub=1200.0,
2025-05-07T20:32:16.3263930Z     contiguous=False,
2025-05-07T20:32:16.3264011Z     compiled=False,
2025-05-07T20:32:16.3264086Z )
2025-05-07T20:32:16.3264298Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3264474Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.3264483Z 
2025-05-07T20:32:16.3264560Z     @given(
2025-05-07T20:32:16.3264676Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3264777Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3264888Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3265000Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3265114Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3265191Z     )
2025-05-07T20:32:16.3265431Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3265523Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3265601Z         self,
2025-05-07T20:32:16.3265682Z         T: int,
2025-05-07T20:32:16.3265755Z         D: int,
2025-05-07T20:32:16.3265851Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3265943Z         contiguous: bool,
2025-05-07T20:32:16.3266024Z         compiled: bool,
2025-05-07T20:32:16.3266101Z     ) -> None:
2025-05-07T20:32:16.3266353Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3266429Z     
2025-05-07T20:32:16.3266597Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3266672Z     
2025-05-07T20:32:16.3266761Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3266881Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3266974Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3267163Z         x0 = x[:, :D]
2025-05-07T20:32:16.3267242Z         x1 = x[:, D:]
2025-05-07T20:32:16.3267318Z     
2025-05-07T20:32:16.3267398Z         if contiguous:
2025-05-07T20:32:16.3267488Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3267575Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3267645Z     
2025-05-07T20:32:16.3267736Z         if scale_ub is not None:
2025-05-07T20:32:16.3267840Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3267972Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3268057Z             )
2025-05-07T20:32:16.3268149Z         else:
2025-05-07T20:32:16.3268247Z             scale_ub_tensor = None
2025-05-07T20:32:16.3268339Z     
2025-05-07T20:32:16.3268464Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3268551Z             op = silu_mul_quant
2025-05-07T20:32:16.3268636Z             if compiled:
2025-05-07T20:32:16.3268731Z                 op = torch.compile(op)
2025-05-07T20:32:16.3268840Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3268912Z     
2025-05-07T20:32:16.3268999Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3269004Z 
2025-05-07T20:32:16.3269101Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3269223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3269318Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3269416Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3269905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3270003Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3270351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3270568Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3270903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3270998Z     kernel = self.compile(
2025-05-07T20:32:16.3271369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3271544Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3271666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3271670Z 
2025-05-07T20:32:16.3271882Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a58254a40>
2025-05-07T20:32:16.3272635Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3273128Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a59c779c0>}
2025-05-07T20:32:16.3273866Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3274051Z context = <triton._C.libtriton.ir.context object at 0x7f6a4294e270>
2025-05-07T20:32:16.3274056Z 
2025-05-07T20:32:16.3274223Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3274560Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3274666Z                            module_map=module_map)
2025-05-07T20:32:16.3274829Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3274925Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3274998Z E       ^
2025-05-07T20:32:16.3275343Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3275423Z 
2025-05-07T20:32:16.3275824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3275829Z 
2025-05-07T20:32:16.3275930Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3276146Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3276227Z     T=4096,
2025-05-07T20:32:16.3276305Z     D=5120,
2025-05-07T20:32:16.3276387Z     scale_ub=1200.0,
2025-05-07T20:32:16.3276473Z     contiguous=False,
2025-05-07T20:32:16.3276554Z     compiled=True,
2025-05-07T20:32:16.3276627Z )
2025-05-07T20:32:16.3276840Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3277007Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.3277012Z 
2025-05-07T20:32:16.3277090Z     @given(
2025-05-07T20:32:16.3277212Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3277308Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3277422Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3277539Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3277650Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3277726Z     )
2025-05-07T20:32:16.3277962Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3278054Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3278140Z         self,
2025-05-07T20:32:16.3278237Z         T: int,
2025-05-07T20:32:16.3278317Z         D: int,
2025-05-07T20:32:16.3278433Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3278525Z         contiguous: bool,
2025-05-07T20:32:16.3278606Z         compiled: bool,
2025-05-07T20:32:16.3278685Z     ) -> None:
2025-05-07T20:32:16.3278777Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3278851Z     
2025-05-07T20:32:16.3279019Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3279097Z     
2025-05-07T20:32:16.3279189Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3279310Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3279397Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3279479Z         x0 = x[:, :D]
2025-05-07T20:32:16.3279555Z         x1 = x[:, D:]
2025-05-07T20:32:16.3279627Z     
2025-05-07T20:32:16.3279710Z         if contiguous:
2025-05-07T20:32:16.3279796Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3279888Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3279961Z     
2025-05-07T20:32:16.3280049Z         if scale_ub is not None:
2025-05-07T20:32:16.3280151Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3280282Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3280353Z             )
2025-05-07T20:32:16.3280431Z         else:
2025-05-07T20:32:16.3280527Z             scale_ub_tensor = None
2025-05-07T20:32:16.3280599Z     
2025-05-07T20:32:16.3280728Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3280814Z             op = silu_mul_quant
2025-05-07T20:32:16.3280894Z             if compiled:
2025-05-07T20:32:16.3280997Z                 op = torch.compile(op)
2025-05-07T20:32:16.3281099Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3281170Z     
2025-05-07T20:32:16.3281261Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3281265Z 
2025-05-07T20:32:16.3281444Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3281576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3281673Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3281767Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3282128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3282216Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3282799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3282896Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3283242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3283462Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3283796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3283885Z     kernel = self.compile(
2025-05-07T20:32:16.3284259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3284427Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3284547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3284560Z 
2025-05-07T20:32:16.3284765Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a58255490>
2025-05-07T20:32:16.3285522Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3286019Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a5985d8a0>}
2025-05-07T20:32:16.3286749Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3286934Z context = <triton._C.libtriton.ir.context object at 0x7f6a429d41f0>
2025-05-07T20:32:16.3286939Z 
2025-05-07T20:32:16.3287103Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3287358Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3287461Z                            module_map=module_map)
2025-05-07T20:32:16.3287623Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3287721Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3287793Z E       ^
2025-05-07T20:32:16.3288149Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3288154Z 
2025-05-07T20:32:16.3288604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3288609Z 
2025-05-07T20:32:16.3288708Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3288927Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3289004Z     T=2048,
2025-05-07T20:32:16.3289083Z     D=7168,
2025-05-07T20:32:16.3289165Z     scale_ub=1200.0,
2025-05-07T20:32:16.3289246Z     contiguous=False,
2025-05-07T20:32:16.3289329Z     compiled=False,
2025-05-07T20:32:16.3289403Z )
2025-05-07T20:32:16.3289613Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3289789Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.3289794Z 
2025-05-07T20:32:16.3289870Z     @given(
2025-05-07T20:32:16.3290068Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3290173Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3290282Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3290398Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3290505Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3290575Z     )
2025-05-07T20:32:16.3290815Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3290976Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3291047Z         self,
2025-05-07T20:32:16.3291123Z         T: int,
2025-05-07T20:32:16.3291196Z         D: int,
2025-05-07T20:32:16.3291292Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3291381Z         contiguous: bool,
2025-05-07T20:32:16.3291461Z         compiled: bool,
2025-05-07T20:32:16.3291536Z     ) -> None:
2025-05-07T20:32:16.3291627Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3291697Z     
2025-05-07T20:32:16.3291868Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3291939Z     
2025-05-07T20:32:16.3292027Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3292149Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3292237Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3292313Z         x0 = x[:, :D]
2025-05-07T20:32:16.3292395Z         x1 = x[:, D:]
2025-05-07T20:32:16.3292463Z     
2025-05-07T20:32:16.3292549Z         if contiguous:
2025-05-07T20:32:16.3292638Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3292723Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3292794Z     
2025-05-07T20:32:16.3292881Z         if scale_ub is not None:
2025-05-07T20:32:16.3292982Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3293168Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3293240Z             )
2025-05-07T20:32:16.3293314Z         else:
2025-05-07T20:32:16.3293407Z             scale_ub_tensor = None
2025-05-07T20:32:16.3293480Z     
2025-05-07T20:32:16.3293605Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3293693Z             op = silu_mul_quant
2025-05-07T20:32:16.3293774Z             if compiled:
2025-05-07T20:32:16.3293870Z                 op = torch.compile(op)
2025-05-07T20:32:16.3293974Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3294044Z     
2025-05-07T20:32:16.3294135Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3294142Z 
2025-05-07T20:32:16.3294236Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3294359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3294458Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3294553Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3295039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3295137Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3295485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3295700Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3296033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3296123Z     kernel = self.compile(
2025-05-07T20:32:16.3296502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3296668Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3296790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3296795Z 
2025-05-07T20:32:16.3296995Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a584c94f0>
2025-05-07T20:32:16.3297834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3298331Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a5a2ecc20>}
2025-05-07T20:32:16.3299059Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3299320Z context = <triton._C.libtriton.ir.context object at 0x7f6a42962a30>
2025-05-07T20:32:16.3299325Z 
2025-05-07T20:32:16.3299486Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3299741Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3299854Z                            module_map=module_map)
2025-05-07T20:32:16.3300010Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3300104Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3300182Z E       ^
2025-05-07T20:32:16.3300525Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3300530Z 
2025-05-07T20:32:16.3300941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3300946Z 
2025-05-07T20:32:16.3301044Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3301261Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3301340Z     T=1,
2025-05-07T20:32:16.3301414Z     D=7168,
2025-05-07T20:32:16.3301493Z     scale_ub=None,
2025-05-07T20:32:16.3301579Z     contiguous=True,
2025-05-07T20:32:16.3301659Z     compiled=False,
2025-05-07T20:32:16.3301737Z )
2025-05-07T20:32:16.3301948Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3302105Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.3302110Z 
2025-05-07T20:32:16.3302184Z     @given(
2025-05-07T20:32:16.3302296Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3302392Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3302509Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3302622Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3302732Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3302806Z     )
2025-05-07T20:32:16.3303044Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3303138Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3303210Z         self,
2025-05-07T20:32:16.3303281Z         T: int,
2025-05-07T20:32:16.3303361Z         D: int,
2025-05-07T20:32:16.3303462Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3303547Z         contiguous: bool,
2025-05-07T20:32:16.3303631Z         compiled: bool,
2025-05-07T20:32:16.3303707Z     ) -> None:
2025-05-07T20:32:16.3303796Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3303868Z     
2025-05-07T20:32:16.3304030Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3304098Z     
2025-05-07T20:32:16.3304194Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3304314Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3304401Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3304478Z         x0 = x[:, :D]
2025-05-07T20:32:16.3304554Z         x1 = x[:, D:]
2025-05-07T20:32:16.3304627Z     
2025-05-07T20:32:16.3304705Z         if contiguous:
2025-05-07T20:32:16.3304791Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3304877Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3304946Z     
2025-05-07T20:32:16.3305116Z         if scale_ub is not None:
2025-05-07T20:32:16.3305222Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3305352Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3305423Z             )
2025-05-07T20:32:16.3305501Z         else:
2025-05-07T20:32:16.3305592Z             scale_ub_tensor = None
2025-05-07T20:32:16.3305662Z     
2025-05-07T20:32:16.3305792Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3305955Z             op = silu_mul_quant
2025-05-07T20:32:16.3306038Z             if compiled:
2025-05-07T20:32:16.3306133Z                 op = torch.compile(op)
2025-05-07T20:32:16.3306232Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3306305Z     
2025-05-07T20:32:16.3306390Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3306395Z 
2025-05-07T20:32:16.3306489Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3306614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3306716Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3306810Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3307300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3307393Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3307745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3307965Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3308311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3308416Z     kernel = self.compile(
2025-05-07T20:32:16.3308810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3308985Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3309104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3309109Z 
2025-05-07T20:32:16.3309307Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a584cbb90>
2025-05-07T20:32:16.3310064Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3310561Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a5a2ee700>}
2025-05-07T20:32:16.3311290Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3311478Z context = <triton._C.libtriton.ir.context object at 0x7f6a42ed98b0>
2025-05-07T20:32:16.3311482Z 
2025-05-07T20:32:16.3311641Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3311895Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3311997Z                            module_map=module_map)
2025-05-07T20:32:16.3312158Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3312256Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3312329Z E       ^
2025-05-07T20:32:16.3312675Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3312680Z 
2025-05-07T20:32:16.3313080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3313085Z 
2025-05-07T20:32:16.3313184Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3313481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3313558Z     T=16384,
2025-05-07T20:32:16.3313636Z     D=7168,
2025-05-07T20:32:16.3313714Z     scale_ub=1200.0,
2025-05-07T20:32:16.3313797Z     contiguous=False,
2025-05-07T20:32:16.3313878Z     compiled=True,
2025-05-07T20:32:16.3313950Z )
2025-05-07T20:32:16.3314161Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3314434Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.3314439Z 
2025-05-07T20:32:16.3314515Z     @given(
2025-05-07T20:32:16.3314634Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3314729Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3314841Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3314961Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3315073Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3315149Z     )
2025-05-07T20:32:16.3315396Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3315484Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3315556Z         self,
2025-05-07T20:32:16.3315632Z         T: int,
2025-05-07T20:32:16.3315705Z         D: int,
2025-05-07T20:32:16.3315798Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3315889Z         contiguous: bool,
2025-05-07T20:32:16.3315974Z         compiled: bool,
2025-05-07T20:32:16.3316057Z     ) -> None:
2025-05-07T20:32:16.3316147Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3316219Z     
2025-05-07T20:32:16.3316388Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3316459Z     
2025-05-07T20:32:16.3316546Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3316669Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3316755Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3316833Z         x0 = x[:, :D]
2025-05-07T20:32:16.3316912Z         x1 = x[:, D:]
2025-05-07T20:32:16.3316979Z     
2025-05-07T20:32:16.3317058Z         if contiguous:
2025-05-07T20:32:16.3317149Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3317232Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3317302Z     
2025-05-07T20:32:16.3317386Z         if scale_ub is not None:
2025-05-07T20:32:16.3317489Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3317625Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3317698Z             )
2025-05-07T20:32:16.3317773Z         else:
2025-05-07T20:32:16.3317867Z             scale_ub_tensor = None
2025-05-07T20:32:16.3317937Z     
2025-05-07T20:32:16.3318063Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3318154Z             op = silu_mul_quant
2025-05-07T20:32:16.3318239Z             if compiled:
2025-05-07T20:32:16.3318356Z                 op = torch.compile(op)
2025-05-07T20:32:16.3318479Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3318558Z     
2025-05-07T20:32:16.3318652Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3318656Z 
2025-05-07T20:32:16.3318748Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3318872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3318975Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3319068Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3319430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3319521Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3319999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3320094Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3320527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3320747Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3321084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3321172Z     kernel = self.compile(
2025-05-07T20:32:16.3321544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3321788Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3321909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3321914Z 
2025-05-07T20:32:16.3322114Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a58866930>
2025-05-07T20:32:16.3322871Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3323365Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a700c2340>}
2025-05-07T20:32:16.3324094Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3324283Z context = <triton._C.libtriton.ir.context object at 0x7f6a42e6afb0>
2025-05-07T20:32:16.3324288Z 
2025-05-07T20:32:16.3324453Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3324704Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3324813Z                            module_map=module_map)
2025-05-07T20:32:16.3324971Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3325068Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3325147Z E       ^
2025-05-07T20:32:16.3325495Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3325500Z 
2025-05-07T20:32:16.3325900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3325908Z 
2025-05-07T20:32:16.3326013Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3326228Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3326305Z     T=1,
2025-05-07T20:32:16.3326385Z     D=7168,
2025-05-07T20:32:16.3326461Z     scale_ub=None,
2025-05-07T20:32:16.3326547Z     contiguous=False,
2025-05-07T20:32:16.3326627Z     compiled=False,
2025-05-07T20:32:16.3326697Z )
2025-05-07T20:32:16.3326910Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3327076Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.3327081Z 
2025-05-07T20:32:16.3327157Z     @given(
2025-05-07T20:32:16.3327274Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3327369Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3327482Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3327592Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3327705Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3327778Z     )
2025-05-07T20:32:16.3328015Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3328103Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3328176Z         self,
2025-05-07T20:32:16.3328252Z         T: int,
2025-05-07T20:32:16.3328325Z         D: int,
2025-05-07T20:32:16.3328421Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3328505Z         contiguous: bool,
2025-05-07T20:32:16.3328664Z         compiled: bool,
2025-05-07T20:32:16.3328743Z     ) -> None:
2025-05-07T20:32:16.3328834Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3328907Z     
2025-05-07T20:32:16.3329069Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3329141Z     
2025-05-07T20:32:16.3329232Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3329352Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3329513Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3329592Z         x0 = x[:, :D]
2025-05-07T20:32:16.3329669Z         x1 = x[:, D:]
2025-05-07T20:32:16.3329739Z     
2025-05-07T20:32:16.3329821Z         if contiguous:
2025-05-07T20:32:16.3329908Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3329991Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3330064Z     
2025-05-07T20:32:16.3330148Z         if scale_ub is not None:
2025-05-07T20:32:16.3330255Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3330389Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3330462Z             )
2025-05-07T20:32:16.3330536Z         else:
2025-05-07T20:32:16.3330627Z             scale_ub_tensor = None
2025-05-07T20:32:16.3330697Z     
2025-05-07T20:32:16.3330825Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3330910Z             op = silu_mul_quant
2025-05-07T20:32:16.3330990Z             if compiled:
2025-05-07T20:32:16.3331094Z                 op = torch.compile(op)
2025-05-07T20:32:16.3331194Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3331262Z     
2025-05-07T20:32:16.3331356Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3331360Z 
2025-05-07T20:32:16.3331452Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3331581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3331678Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3331773Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3332267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3332359Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3332709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3332926Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3333304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3333397Z     kernel = self.compile(
2025-05-07T20:32:16.3333768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3333935Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3334058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3334066Z 
2025-05-07T20:32:16.3334264Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a593001d0>
2025-05-07T20:32:16.3335021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3335515Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a703bbc40>}
2025-05-07T20:32:16.3336243Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3336429Z context = <triton._C.libtriton.ir.context object at 0x7f6a42619b30>
2025-05-07T20:32:16.3336434Z 
2025-05-07T20:32:16.3336684Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3336945Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3337045Z                            module_map=module_map)
2025-05-07T20:32:16.3337201Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3337301Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3337373Z E       ^
2025-05-07T20:32:16.3337796Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3337801Z 
2025-05-07T20:32:16.3338208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3338214Z 
2025-05-07T20:32:16.3338334Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3338577Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3338650Z     T=2048,
2025-05-07T20:32:16.3338726Z     D=7168,
2025-05-07T20:32:16.3338805Z     scale_ub=None,
2025-05-07T20:32:16.3338888Z     contiguous=False,
2025-05-07T20:32:16.3338973Z     compiled=True,
2025-05-07T20:32:16.3339043Z )
2025-05-07T20:32:16.3339251Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3339419Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.3339431Z 
2025-05-07T20:32:16.3339503Z     @given(
2025-05-07T20:32:16.3339619Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3339719Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3339829Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3339942Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3340056Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3340125Z     )
2025-05-07T20:32:16.3340371Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3340462Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3340536Z         self,
2025-05-07T20:32:16.3340610Z         T: int,
2025-05-07T20:32:16.3340685Z         D: int,
2025-05-07T20:32:16.3340781Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3340871Z         contiguous: bool,
2025-05-07T20:32:16.3340951Z         compiled: bool,
2025-05-07T20:32:16.3341024Z     ) -> None:
2025-05-07T20:32:16.3341121Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3341191Z     
2025-05-07T20:32:16.3341352Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3341423Z     
2025-05-07T20:32:16.3341511Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3341637Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3341721Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3341796Z         x0 = x[:, :D]
2025-05-07T20:32:16.3341874Z         x1 = x[:, D:]
2025-05-07T20:32:16.3341945Z     
2025-05-07T20:32:16.3342026Z         if contiguous:
2025-05-07T20:32:16.3342120Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3342205Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3342276Z     
2025-05-07T20:32:16.3342364Z         if scale_ub is not None:
2025-05-07T20:32:16.3342464Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3342593Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3342669Z             )
2025-05-07T20:32:16.3346656Z         else:
2025-05-07T20:32:16.3346756Z             scale_ub_tensor = None
2025-05-07T20:32:16.3346827Z     
2025-05-07T20:32:16.3346960Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3347045Z             op = silu_mul_quant
2025-05-07T20:32:16.3347126Z             if compiled:
2025-05-07T20:32:16.3347227Z                 op = torch.compile(op)
2025-05-07T20:32:16.3347328Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3347400Z     
2025-05-07T20:32:16.3347487Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3347617Z 
2025-05-07T20:32:16.3347713Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3347845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3347942Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3348039Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3348454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3348622Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3349106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3349201Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3349549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3349768Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3350104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3350194Z     kernel = self.compile(
2025-05-07T20:32:16.3350574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3350744Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3350878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3350883Z 
2025-05-07T20:32:16.3351082Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a593019d0>
2025-05-07T20:32:16.3351839Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3352342Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a9c629800>}
2025-05-07T20:32:16.3353071Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3353259Z context = <triton._C.libtriton.ir.context object at 0x7f6a42655cf0>
2025-05-07T20:32:16.3353268Z 
2025-05-07T20:32:16.3353427Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3353685Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3353790Z                            module_map=module_map)
2025-05-07T20:32:16.3353950Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3354048Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3354122Z E       ^
2025-05-07T20:32:16.3354470Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3354474Z 
2025-05-07T20:32:16.3354880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3354885Z 
2025-05-07T20:32:16.3354983Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3355201Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3355279Z     T=4096,
2025-05-07T20:32:16.3355350Z     D=7168,
2025-05-07T20:32:16.3355430Z     scale_ub=None,
2025-05-07T20:32:16.3355511Z     contiguous=False,
2025-05-07T20:32:16.3355588Z     compiled=True,
2025-05-07T20:32:16.3355659Z )
2025-05-07T20:32:16.3355870Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3356036Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.3356044Z 
2025-05-07T20:32:16.3356208Z     @given(
2025-05-07T20:32:16.3356327Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3356427Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3356536Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3356646Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3356761Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3356905Z     )
2025-05-07T20:32:16.3357143Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3357240Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3357313Z         self,
2025-05-07T20:32:16.3357388Z         T: int,
2025-05-07T20:32:16.3357465Z         D: int,
2025-05-07T20:32:16.3357559Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3357647Z         contiguous: bool,
2025-05-07T20:32:16.3357733Z         compiled: bool,
2025-05-07T20:32:16.3357805Z     ) -> None:
2025-05-07T20:32:16.3357902Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3357980Z     
2025-05-07T20:32:16.3358171Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3358265Z     
2025-05-07T20:32:16.3358355Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3358477Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3358564Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3358643Z         x0 = x[:, :D]
2025-05-07T20:32:16.3358727Z         x1 = x[:, D:]
2025-05-07T20:32:16.3358801Z     
2025-05-07T20:32:16.3358881Z         if contiguous:
2025-05-07T20:32:16.3358970Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3359058Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3359128Z     
2025-05-07T20:32:16.3359426Z         if scale_ub is not None:
2025-05-07T20:32:16.3359581Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3359751Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3359836Z             )
2025-05-07T20:32:16.3359919Z         else:
2025-05-07T20:32:16.3360010Z             scale_ub_tensor = None
2025-05-07T20:32:16.3360084Z     
2025-05-07T20:32:16.3360209Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3360298Z             op = silu_mul_quant
2025-05-07T20:32:16.3360383Z             if compiled:
2025-05-07T20:32:16.3360479Z                 op = torch.compile(op)
2025-05-07T20:32:16.3360579Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3360655Z     
2025-05-07T20:32:16.3360741Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3360746Z 
2025-05-07T20:32:16.3360841Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3360968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3361064Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3361168Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3361527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3361620Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3362107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3362200Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3362552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3362774Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3363102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3363196Z     kernel = self.compile(
2025-05-07T20:32:16.3363568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3363741Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3364010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3364016Z 
2025-05-07T20:32:16.3364218Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a58ff3dd0>
2025-05-07T20:32:16.3364977Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3365581Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43953880>}
2025-05-07T20:32:16.3366313Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3366499Z context = <triton._C.libtriton.ir.context object at 0x7f6a42338c30>
2025-05-07T20:32:16.3366508Z 
2025-05-07T20:32:16.3366666Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3366923Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3367025Z                            module_map=module_map)
2025-05-07T20:32:16.3367183Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3367281Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3367358Z E       ^
2025-05-07T20:32:16.3367704Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3367709Z 
2025-05-07T20:32:16.3368115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3368120Z 
2025-05-07T20:32:16.3368248Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3368493Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3368570Z     T=16384,
2025-05-07T20:32:16.3368650Z     D=5120,
2025-05-07T20:32:16.3368730Z     scale_ub=1200.0,
2025-05-07T20:32:16.3368814Z     contiguous=False,
2025-05-07T20:32:16.3368898Z     compiled=False,
2025-05-07T20:32:16.3368970Z )
2025-05-07T20:32:16.3369183Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3369361Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.3369371Z 
2025-05-07T20:32:16.3369444Z     @given(
2025-05-07T20:32:16.3369562Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3369657Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3369767Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3369884Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3369993Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3370061Z     )
2025-05-07T20:32:16.3370308Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3370397Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3370471Z         self,
2025-05-07T20:32:16.3370546Z         T: int,
2025-05-07T20:32:16.3370617Z         D: int,
2025-05-07T20:32:16.3370712Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3370801Z         contiguous: bool,
2025-05-07T20:32:16.3370884Z         compiled: bool,
2025-05-07T20:32:16.3370971Z     ) -> None:
2025-05-07T20:32:16.3371061Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3371129Z     
2025-05-07T20:32:16.3371294Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3371365Z     
2025-05-07T20:32:16.3371453Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3371576Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3371660Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3371736Z         x0 = x[:, :D]
2025-05-07T20:32:16.3371897Z         x1 = x[:, D:]
2025-05-07T20:32:16.3371968Z     
2025-05-07T20:32:16.3372048Z         if contiguous:
2025-05-07T20:32:16.3372136Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3372221Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3372291Z     
2025-05-07T20:32:16.3372387Z         if scale_ub is not None:
2025-05-07T20:32:16.3372487Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3372619Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3372844Z             )
2025-05-07T20:32:16.3372918Z         else:
2025-05-07T20:32:16.3373066Z             scale_ub_tensor = None
2025-05-07T20:32:16.3373136Z     
2025-05-07T20:32:16.3373262Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3373350Z             op = silu_mul_quant
2025-05-07T20:32:16.3373430Z             if compiled:
2025-05-07T20:32:16.3373526Z                 op = torch.compile(op)
2025-05-07T20:32:16.3373629Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3373704Z     
2025-05-07T20:32:16.3373789Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3373797Z 
2025-05-07T20:32:16.3373889Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3374012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3374113Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3374206Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3374695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3374797Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3375149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3375371Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3375703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3375795Z     kernel = self.compile(
2025-05-07T20:32:16.3376171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3376339Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3376461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3376470Z 
2025-05-07T20:32:16.3376672Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a597fad80>
2025-05-07T20:32:16.3377426Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3377920Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a43952700>}
2025-05-07T20:32:16.3378650Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3378839Z context = <triton._C.libtriton.ir.context object at 0x7f6a423dac70>
2025-05-07T20:32:16.3378844Z 
2025-05-07T20:32:16.3379004Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3379262Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3379369Z                            module_map=module_map)
2025-05-07T20:32:16.3379524Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3379618Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3379694Z E       ^
2025-05-07T20:32:16.3380039Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3380151Z 
2025-05-07T20:32:16.3380561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3380565Z 
2025-05-07T20:32:16.3380663Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3380877Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3380954Z     T=16384,
2025-05-07T20:32:16.3381103Z     D=5120,
2025-05-07T20:32:16.3381183Z     scale_ub=1200.0,
2025-05-07T20:32:16.3381267Z     contiguous=True,
2025-05-07T20:32:16.3381344Z     compiled=True,
2025-05-07T20:32:16.3381417Z )
2025-05-07T20:32:16.3381628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3381796Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.3381800Z 
2025-05-07T20:32:16.3381875Z     @given(
2025-05-07T20:32:16.3381990Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3382092Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3382206Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3382318Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3382426Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3382500Z     )
2025-05-07T20:32:16.3382739Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3382842Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3382914Z         self,
2025-05-07T20:32:16.3382988Z         T: int,
2025-05-07T20:32:16.3383062Z         D: int,
2025-05-07T20:32:16.3383155Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3383241Z         contiguous: bool,
2025-05-07T20:32:16.3383326Z         compiled: bool,
2025-05-07T20:32:16.3383402Z     ) -> None:
2025-05-07T20:32:16.3383493Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3383567Z     
2025-05-07T20:32:16.3383734Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3383805Z     
2025-05-07T20:32:16.3383900Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3384021Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3384110Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3384185Z         x0 = x[:, :D]
2025-05-07T20:32:16.3384260Z         x1 = x[:, D:]
2025-05-07T20:32:16.3384329Z     
2025-05-07T20:32:16.3384409Z         if contiguous:
2025-05-07T20:32:16.3384502Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3384591Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3384661Z     
2025-05-07T20:32:16.3384748Z         if scale_ub is not None:
2025-05-07T20:32:16.3384852Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3384980Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3385053Z             )
2025-05-07T20:32:16.3385125Z         else:
2025-05-07T20:32:16.3385214Z             scale_ub_tensor = None
2025-05-07T20:32:16.3385287Z     
2025-05-07T20:32:16.3385416Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3385502Z             op = silu_mul_quant
2025-05-07T20:32:16.3385585Z             if compiled:
2025-05-07T20:32:16.3385680Z                 op = torch.compile(op)
2025-05-07T20:32:16.3385779Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3385851Z     
2025-05-07T20:32:16.3385937Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3385949Z 
2025-05-07T20:32:16.3386042Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3386169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3386264Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3386360Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3386718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3386807Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3387371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3387466Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3387815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3388034Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3388463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3388563Z     kernel = self.compile(
2025-05-07T20:32:16.3388951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3389121Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3389245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3389250Z 
2025-05-07T20:32:16.3389456Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a596e6db0>
2025-05-07T20:32:16.3390212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3390703Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a58561e40>}
2025-05-07T20:32:16.3391434Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3391621Z context = <triton._C.libtriton.ir.context object at 0x7f6a431a6530>
2025-05-07T20:32:16.3391626Z 
2025-05-07T20:32:16.3391786Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3392045Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3392149Z                            module_map=module_map)
2025-05-07T20:32:16.3392304Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3392403Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3392477Z E       ^
2025-05-07T20:32:16.3392821Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3392833Z 
2025-05-07T20:32:16.3393237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3393242Z 
2025-05-07T20:32:16.3393340Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3393558Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3393632Z     T=16384,
2025-05-07T20:32:16.3393706Z     D=5120,
2025-05-07T20:32:16.3393791Z     scale_ub=None,
2025-05-07T20:32:16.3393873Z     contiguous=False,
2025-05-07T20:32:16.3393951Z     compiled=True,
2025-05-07T20:32:16.3394026Z )
2025-05-07T20:32:16.3394236Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3394409Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.3394413Z 
2025-05-07T20:32:16.3394486Z     @given(
2025-05-07T20:32:16.3394606Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3394701Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3394810Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3394921Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3395035Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3395106Z     )
2025-05-07T20:32:16.3395347Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3395523Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3395596Z         self,
2025-05-07T20:32:16.3395671Z         T: int,
2025-05-07T20:32:16.3395743Z         D: int,
2025-05-07T20:32:16.3395837Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3395924Z         contiguous: bool,
2025-05-07T20:32:16.3396006Z         compiled: bool,
2025-05-07T20:32:16.3396080Z     ) -> None:
2025-05-07T20:32:16.3396174Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3396321Z     
2025-05-07T20:32:16.3396483Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3396555Z     
2025-05-07T20:32:16.3396643Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3396764Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3396849Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3396926Z         x0 = x[:, :D]
2025-05-07T20:32:16.3397005Z         x1 = x[:, D:]
2025-05-07T20:32:16.3397072Z     
2025-05-07T20:32:16.3397149Z         if contiguous:
2025-05-07T20:32:16.3397242Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3397326Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3397395Z     
2025-05-07T20:32:16.3397484Z         if scale_ub is not None:
2025-05-07T20:32:16.3397585Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3397714Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3397791Z             )
2025-05-07T20:32:16.3397866Z         else:
2025-05-07T20:32:16.3397961Z             scale_ub_tensor = None
2025-05-07T20:32:16.3398036Z     
2025-05-07T20:32:16.3398159Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3398250Z             op = silu_mul_quant
2025-05-07T20:32:16.3398332Z             if compiled:
2025-05-07T20:32:16.3398447Z                 op = torch.compile(op)
2025-05-07T20:32:16.3398561Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3398647Z     
2025-05-07T20:32:16.3398735Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3398739Z 
2025-05-07T20:32:16.3398839Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3398962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3399056Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3399151Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3399509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3399604Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3400085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3400179Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3400530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3400745Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3401078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3401174Z     kernel = self.compile(
2025-05-07T20:32:16.3401544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3401718Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3401839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3401848Z 
2025-05-07T20:32:16.3402048Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a597f8a70>
2025-05-07T20:32:16.3402805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3403378Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a59321e40>}
2025-05-07T20:32:16.3404105Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3404290Z context = <triton._C.libtriton.ir.context object at 0x7f6a42d78ff0>
2025-05-07T20:32:16.3404392Z 
2025-05-07T20:32:16.3404558Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3404813Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3404916Z                            module_map=module_map)
2025-05-07T20:32:16.3405074Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3405170Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3405246Z E       ^
2025-05-07T20:32:16.3405599Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3405604Z 
2025-05-07T20:32:16.3406006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3406011Z 
2025-05-07T20:32:16.3406114Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3406331Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3406421Z     T=2048,
2025-05-07T20:32:16.3406492Z     D=5120,
2025-05-07T20:32:16.3406571Z     scale_ub=None,
2025-05-07T20:32:16.3406657Z     contiguous=False,
2025-05-07T20:32:16.3406738Z     compiled=True,
2025-05-07T20:32:16.3406814Z )
2025-05-07T20:32:16.3407026Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3407191Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.3407196Z 
2025-05-07T20:32:16.3407271Z     @given(
2025-05-07T20:32:16.3407394Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3407490Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3407605Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3407717Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3407827Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3407897Z     )
2025-05-07T20:32:16.3408162Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3408268Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3408357Z         self,
2025-05-07T20:32:16.3408430Z         T: int,
2025-05-07T20:32:16.3408504Z         D: int,
2025-05-07T20:32:16.3408600Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3408685Z         contiguous: bool,
2025-05-07T20:32:16.3408773Z         compiled: bool,
2025-05-07T20:32:16.3408848Z     ) -> None:
2025-05-07T20:32:16.3408938Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3409010Z     
2025-05-07T20:32:16.3409177Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3409248Z     
2025-05-07T20:32:16.3409335Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3409458Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3409545Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3409622Z         x0 = x[:, :D]
2025-05-07T20:32:16.3409697Z         x1 = x[:, D:]
2025-05-07T20:32:16.3409775Z     
2025-05-07T20:32:16.3409855Z         if contiguous:
2025-05-07T20:32:16.3409942Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3410029Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3410099Z     
2025-05-07T20:32:16.3410185Z         if scale_ub is not None:
2025-05-07T20:32:16.3410291Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3410421Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3410493Z             )
2025-05-07T20:32:16.3410569Z         else:
2025-05-07T20:32:16.3410744Z             scale_ub_tensor = None
2025-05-07T20:32:16.3410816Z     
2025-05-07T20:32:16.3410943Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3411028Z             op = silu_mul_quant
2025-05-07T20:32:16.3411113Z             if compiled:
2025-05-07T20:32:16.3411209Z                 op = torch.compile(op)
2025-05-07T20:32:16.3411308Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3411380Z     
2025-05-07T20:32:16.3411549Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3411554Z 
2025-05-07T20:32:16.3411645Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3411775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3411870Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3411968Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3412330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3412418Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3412906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3413048Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3413395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3413620Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3413955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3414048Z     kernel = self.compile(
2025-05-07T20:32:16.3414419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3414587Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3414712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3414723Z 
2025-05-07T20:32:16.3414922Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a596e7b60>
2025-05-07T20:32:16.3415680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3416179Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a59320d60>}
2025-05-07T20:32:16.3416906Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3417092Z context = <triton._C.libtriton.ir.context object at 0x7f6a42d0f5b0>
2025-05-07T20:32:16.3417097Z 
2025-05-07T20:32:16.3417260Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3417516Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3417619Z                            module_map=module_map)
2025-05-07T20:32:16.3417777Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3417875Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3417954Z E       ^
2025-05-07T20:32:16.3418298Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3418307Z 
2025-05-07T20:32:16.3418708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3418712Z 
2025-05-07T20:32:16.3418811Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3419031Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3419188Z     T=2048,
2025-05-07T20:32:16.3419263Z     D=5120,
2025-05-07T20:32:16.3419344Z     scale_ub=1200.0,
2025-05-07T20:32:16.3419430Z     contiguous=False,
2025-05-07T20:32:16.3419508Z     compiled=True,
2025-05-07T20:32:16.3419583Z )
2025-05-07T20:32:16.3419794Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3419969Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.3420048Z 
2025-05-07T20:32:16.3420124Z     @given(
2025-05-07T20:32:16.3420238Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3420335Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3420446Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3420560Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3420672Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3420743Z     )
2025-05-07T20:32:16.3420985Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3421080Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3421153Z         self,
2025-05-07T20:32:16.3421231Z         T: int,
2025-05-07T20:32:16.3421303Z         D: int,
2025-05-07T20:32:16.3421398Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3421486Z         contiguous: bool,
2025-05-07T20:32:16.3421568Z         compiled: bool,
2025-05-07T20:32:16.3421641Z     ) -> None:
2025-05-07T20:32:16.3421741Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3421810Z     
2025-05-07T20:32:16.3421975Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3422045Z     
2025-05-07T20:32:16.3422132Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3422252Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3422340Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3422414Z         x0 = x[:, :D]
2025-05-07T20:32:16.3422493Z         x1 = x[:, D:]
2025-05-07T20:32:16.3422559Z     
2025-05-07T20:32:16.3422641Z         if contiguous:
2025-05-07T20:32:16.3422730Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3422818Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3422886Z     
2025-05-07T20:32:16.3422975Z         if scale_ub is not None:
2025-05-07T20:32:16.3423075Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3423204Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3423284Z             )
2025-05-07T20:32:16.3423356Z         else:
2025-05-07T20:32:16.3423447Z             scale_ub_tensor = None
2025-05-07T20:32:16.3423519Z     
2025-05-07T20:32:16.3423644Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3423729Z             op = silu_mul_quant
2025-05-07T20:32:16.3423813Z             if compiled:
2025-05-07T20:32:16.3423907Z                 op = torch.compile(op)
2025-05-07T20:32:16.3424010Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3424076Z     
2025-05-07T20:32:16.3424167Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3424172Z 
2025-05-07T20:32:16.3424268Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3424393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3424491Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3424593Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3424949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3425045Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3425528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3425621Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3425971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3426270Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3426602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3426694Z     kernel = self.compile(
2025-05-07T20:32:16.3427064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3427236Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3427434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3427439Z 
2025-05-07T20:32:16.3427638Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a70352d20>
2025-05-07T20:32:16.3428451Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3428948Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a59031a80>}
2025-05-07T20:32:16.3429680Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3429863Z context = <triton._C.libtriton.ir.context object at 0x7f6a423cfaf0>
2025-05-07T20:32:16.3429873Z 
2025-05-07T20:32:16.3430039Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3430290Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3430391Z                            module_map=module_map)
2025-05-07T20:32:16.3430548Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3430642Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3430715Z E       ^
2025-05-07T20:32:16.3431064Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3431069Z 
2025-05-07T20:32:16.3431471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3431475Z 
2025-05-07T20:32:16.3431575Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3431797Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3431869Z     T=4096,
2025-05-07T20:32:16.3431942Z     D=5120,
2025-05-07T20:32:16.3432022Z     scale_ub=1200.0,
2025-05-07T20:32:16.3432102Z     contiguous=True,
2025-05-07T20:32:16.3432183Z     compiled=True,
2025-05-07T20:32:16.3432251Z )
2025-05-07T20:32:16.3432461Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3432628Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.3432633Z 
2025-05-07T20:32:16.3432707Z     @given(
2025-05-07T20:32:16.3432823Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3432918Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3433026Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3433139Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3433248Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3433324Z     )
2025-05-07T20:32:16.3433565Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3433655Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3433728Z         self,
2025-05-07T20:32:16.3433803Z         T: int,
2025-05-07T20:32:16.3433878Z         D: int,
2025-05-07T20:32:16.3433977Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3434063Z         contiguous: bool,
2025-05-07T20:32:16.3434146Z         compiled: bool,
2025-05-07T20:32:16.3434229Z     ) -> None:
2025-05-07T20:32:16.3434447Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3434519Z     
2025-05-07T20:32:16.3434691Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3434760Z     
2025-05-07T20:32:16.3434845Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3434970Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3435055Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3435132Z         x0 = x[:, :D]
2025-05-07T20:32:16.3435310Z         x1 = x[:, D:]
2025-05-07T20:32:16.3435379Z     
2025-05-07T20:32:16.3435461Z         if contiguous:
2025-05-07T20:32:16.3435557Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3435642Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3435714Z     
2025-05-07T20:32:16.3435799Z         if scale_ub is not None:
2025-05-07T20:32:16.3435901Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3436032Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3436106Z             )
2025-05-07T20:32:16.3436184Z         else:
2025-05-07T20:32:16.3436282Z             scale_ub_tensor = None
2025-05-07T20:32:16.3436354Z     
2025-05-07T20:32:16.3436481Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3436571Z             op = silu_mul_quant
2025-05-07T20:32:16.3436653Z             if compiled:
2025-05-07T20:32:16.3436747Z                 op = torch.compile(op)
2025-05-07T20:32:16.3436851Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3436927Z     
2025-05-07T20:32:16.3437015Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3437019Z 
2025-05-07T20:32:16.3437113Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3437239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3437340Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3437436Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3437800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3437891Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3438392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3438498Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3438869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3439089Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3439423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3439513Z     kernel = self.compile(
2025-05-07T20:32:16.3439883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3440057Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3440183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3440188Z 
2025-05-07T20:32:16.3440391Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a70e4db50>
2025-05-07T20:32:16.3441145Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3441647Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a59033420>}
2025-05-07T20:32:16.3442375Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3442561Z context = <triton._C.libtriton.ir.context object at 0x7f6a423b42b0>
2025-05-07T20:32:16.3442647Z 
2025-05-07T20:32:16.3442814Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3443069Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3443176Z                            module_map=module_map)
2025-05-07T20:32:16.3443332Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3443504Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3443580Z E       ^
2025-05-07T20:32:16.3443924Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3443928Z 
2025-05-07T20:32:16.3444330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3444339Z 
2025-05-07T20:32:16.3444435Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3444655Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3444733Z     T=128,
2025-05-07T20:32:16.3444806Z     D=5120,
2025-05-07T20:32:16.3444885Z     scale_ub=1200.0,
2025-05-07T20:32:16.3444974Z     contiguous=False,
2025-05-07T20:32:16.3445056Z     compiled=True,
2025-05-07T20:32:16.3445127Z )
2025-05-07T20:32:16.3445342Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3445506Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.3445517Z 
2025-05-07T20:32:16.3445590Z     @given(
2025-05-07T20:32:16.3445709Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3445804Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3445916Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3446029Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3446138Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3446214Z     )
2025-05-07T20:32:16.3446455Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3446544Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3446619Z         self,
2025-05-07T20:32:16.3446693Z         T: int,
2025-05-07T20:32:16.3446766Z         D: int,
2025-05-07T20:32:16.3446866Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3446952Z         contiguous: bool,
2025-05-07T20:32:16.3447040Z         compiled: bool,
2025-05-07T20:32:16.3447112Z     ) -> None:
2025-05-07T20:32:16.3447202Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3447276Z     
2025-05-07T20:32:16.3447439Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3447509Z     
2025-05-07T20:32:16.3447599Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3447721Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3447804Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3447887Z         x0 = x[:, :D]
2025-05-07T20:32:16.3447967Z         x1 = x[:, D:]
2025-05-07T20:32:16.3448036Z     
2025-05-07T20:32:16.3448117Z         if contiguous:
2025-05-07T20:32:16.3448204Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3448290Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3448360Z     
2025-05-07T20:32:16.3448447Z         if scale_ub is not None:
2025-05-07T20:32:16.3448550Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3448678Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3448757Z             )
2025-05-07T20:32:16.3448834Z         else:
2025-05-07T20:32:16.3448928Z             scale_ub_tensor = None
2025-05-07T20:32:16.3448999Z     
2025-05-07T20:32:16.3449127Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3449212Z             op = silu_mul_quant
2025-05-07T20:32:16.3449291Z             if compiled:
2025-05-07T20:32:16.3449391Z                 op = torch.compile(op)
2025-05-07T20:32:16.3449490Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3449641Z     
2025-05-07T20:32:16.3449737Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3449742Z 
2025-05-07T20:32:16.3449834Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3449964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3450063Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3450158Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3450594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3450683Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3451163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3451260Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3451608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3451835Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3452165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3452254Z     kernel = self.compile(
2025-05-07T20:32:16.3452627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3452801Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3452926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3452930Z 
2025-05-07T20:32:16.3453181Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43b0ec90>
2025-05-07T20:32:16.3453939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3454432Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a59032c00>}
2025-05-07T20:32:16.3455157Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3455348Z context = <triton._C.libtriton.ir.context object at 0x7f6a4239efb0>
2025-05-07T20:32:16.3455353Z 
2025-05-07T20:32:16.3455511Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3455762Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3455865Z                            module_map=module_map)
2025-05-07T20:32:16.3456022Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3456122Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3456197Z E       ^
2025-05-07T20:32:16.3456542Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3456546Z 
2025-05-07T20:32:16.3456948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3456953Z 
2025-05-07T20:32:16.3457052Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3457272Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3457347Z     T=16384,
2025-05-07T20:32:16.3457420Z     D=7168,
2025-05-07T20:32:16.3457504Z     scale_ub=1200.0,
2025-05-07T20:32:16.3457586Z     contiguous=True,
2025-05-07T20:32:16.3457663Z     compiled=True,
2025-05-07T20:32:16.3457736Z )
2025-05-07T20:32:16.3457948Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3458216Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.3458222Z 
2025-05-07T20:32:16.3458312Z     @given(
2025-05-07T20:32:16.3458446Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3458550Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3458659Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3458772Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3458965Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3459033Z     )
2025-05-07T20:32:16.3459466Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3459601Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3459693Z         self,
2025-05-07T20:32:16.3459767Z         T: int,
2025-05-07T20:32:16.3459843Z         D: int,
2025-05-07T20:32:16.3459938Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3460022Z         contiguous: bool,
2025-05-07T20:32:16.3460109Z         compiled: bool,
2025-05-07T20:32:16.3460191Z     ) -> None:
2025-05-07T20:32:16.3460287Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3460356Z     
2025-05-07T20:32:16.3460519Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3460593Z     
2025-05-07T20:32:16.3460683Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3460806Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3460895Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3460978Z         x0 = x[:, :D]
2025-05-07T20:32:16.3461053Z         x1 = x[:, D:]
2025-05-07T20:32:16.3461125Z     
2025-05-07T20:32:16.3461204Z         if contiguous:
2025-05-07T20:32:16.3461291Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3461379Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3461448Z     
2025-05-07T20:32:16.3461534Z         if scale_ub is not None:
2025-05-07T20:32:16.3461641Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3461774Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3461851Z             )
2025-05-07T20:32:16.3461926Z         else:
2025-05-07T20:32:16.3462021Z             scale_ub_tensor = None
2025-05-07T20:32:16.3462097Z     
2025-05-07T20:32:16.3462223Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3462309Z             op = silu_mul_quant
2025-05-07T20:32:16.3462393Z             if compiled:
2025-05-07T20:32:16.3462487Z                 op = torch.compile(op)
2025-05-07T20:32:16.3462591Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3462661Z     
2025-05-07T20:32:16.3466563Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3466572Z 
2025-05-07T20:32:16.3466671Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3466798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3466902Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3467001Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3467378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3467467Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3467950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3468047Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3468393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3468615Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3468952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3469042Z     kernel = self.compile(
2025-05-07T20:32:16.3469419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3469773Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3469906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3469911Z 
2025-05-07T20:32:16.3470114Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43ed0b90>
2025-05-07T20:32:16.3470876Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3472041Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a59b67740>}
2025-05-07T20:32:16.3472773Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3472972Z context = <triton._C.libtriton.ir.context object at 0x7f6a42553f70>
2025-05-07T20:32:16.3472977Z 
2025-05-07T20:32:16.3473138Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3473393Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3473503Z                            module_map=module_map)
2025-05-07T20:32:16.3473666Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3473761Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3473838Z E       ^
2025-05-07T20:32:16.3474184Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3474189Z 
2025-05-07T20:32:16.3474597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3474601Z 
2025-05-07T20:32:16.3474698Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3474917Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3474993Z     T=16384,
2025-05-07T20:32:16.3475066Z     D=5120,
2025-05-07T20:32:16.3475144Z     scale_ub=1200.0,
2025-05-07T20:32:16.3475230Z     contiguous=True,
2025-05-07T20:32:16.3475310Z     compiled=False,
2025-05-07T20:32:16.3475384Z )
2025-05-07T20:32:16.3475596Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3475773Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.3475778Z 
2025-05-07T20:32:16.3475854Z     @given(
2025-05-07T20:32:16.3475973Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3476067Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3476180Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3476292Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3476403Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3476478Z     )
2025-05-07T20:32:16.3476718Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3476810Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3476881Z         self,
2025-05-07T20:32:16.3476956Z         T: int,
2025-05-07T20:32:16.3477033Z         D: int,
2025-05-07T20:32:16.3477129Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3477220Z         contiguous: bool,
2025-05-07T20:32:16.3477307Z         compiled: bool,
2025-05-07T20:32:16.3477384Z     ) -> None:
2025-05-07T20:32:16.3477476Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3477554Z     
2025-05-07T20:32:16.3477717Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3477791Z     
2025-05-07T20:32:16.3477884Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3478009Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3478099Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3478286Z         x0 = x[:, :D]
2025-05-07T20:32:16.3478384Z         x1 = x[:, D:]
2025-05-07T20:32:16.3478464Z     
2025-05-07T20:32:16.3478544Z         if contiguous:
2025-05-07T20:32:16.3478632Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3478727Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3478794Z     
2025-05-07T20:32:16.3478880Z         if scale_ub is not None:
2025-05-07T20:32:16.3478984Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3479190Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3479266Z             )
2025-05-07T20:32:16.3479339Z         else:
2025-05-07T20:32:16.3479430Z             scale_ub_tensor = None
2025-05-07T20:32:16.3479501Z     
2025-05-07T20:32:16.3479631Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3479722Z             op = silu_mul_quant
2025-05-07T20:32:16.3479809Z             if compiled:
2025-05-07T20:32:16.3479906Z                 op = torch.compile(op)
2025-05-07T20:32:16.3480014Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3480092Z     
2025-05-07T20:32:16.3480180Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3480185Z 
2025-05-07T20:32:16.3480277Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3480405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3480503Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3480603Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3481095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3481187Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3481539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3481757Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3482090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3482181Z     kernel = self.compile(
2025-05-07T20:32:16.3482554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3482726Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3482847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3482857Z 
2025-05-07T20:32:16.3483057Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43ecc830>
2025-05-07T20:32:16.3483819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3484316Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a596c7560>}
2025-05-07T20:32:16.3485045Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3485230Z context = <triton._C.libtriton.ir.context object at 0x7f6a42049f70>
2025-05-07T20:32:16.3485239Z 
2025-05-07T20:32:16.3485401Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3485655Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3485759Z                            module_map=module_map)
2025-05-07T20:32:16.3485920Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3486015Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3486088Z E       ^
2025-05-07T20:32:16.3486517Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3486522Z 
2025-05-07T20:32:16.3486926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3486931Z 
2025-05-07T20:32:16.3487036Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3487253Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3487401Z     T=1,
2025-05-07T20:32:16.3487478Z     D=7168,
2025-05-07T20:32:16.3487557Z     scale_ub=1200.0,
2025-05-07T20:32:16.3487640Z     contiguous=False,
2025-05-07T20:32:16.3487721Z     compiled=False,
2025-05-07T20:32:16.3487791Z )
2025-05-07T20:32:16.3488005Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3488196Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.3488201Z 
2025-05-07T20:32:16.3488284Z     @given(
2025-05-07T20:32:16.3488423Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3488518Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3488629Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3488746Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3488855Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3488924Z     )
2025-05-07T20:32:16.3489164Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3489256Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3489327Z         self,
2025-05-07T20:32:16.3489401Z         T: int,
2025-05-07T20:32:16.3489475Z         D: int,
2025-05-07T20:32:16.3489570Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3489659Z         contiguous: bool,
2025-05-07T20:32:16.3489738Z         compiled: bool,
2025-05-07T20:32:16.3489817Z     ) -> None:
2025-05-07T20:32:16.3489908Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3489978Z     
2025-05-07T20:32:16.3490151Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3490221Z     
2025-05-07T20:32:16.3490309Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3490432Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3490516Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3490590Z         x0 = x[:, :D]
2025-05-07T20:32:16.3490669Z         x1 = x[:, D:]
2025-05-07T20:32:16.3490744Z     
2025-05-07T20:32:16.3490822Z         if contiguous:
2025-05-07T20:32:16.3490914Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3491000Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3491071Z     
2025-05-07T20:32:16.3491157Z         if scale_ub is not None:
2025-05-07T20:32:16.3491258Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3491389Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3491462Z             )
2025-05-07T20:32:16.3491531Z         else:
2025-05-07T20:32:16.3491629Z             scale_ub_tensor = None
2025-05-07T20:32:16.3491697Z     
2025-05-07T20:32:16.3491822Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3491908Z             op = silu_mul_quant
2025-05-07T20:32:16.3491990Z             if compiled:
2025-05-07T20:32:16.3492085Z                 op = torch.compile(op)
2025-05-07T20:32:16.3492191Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3492265Z     
2025-05-07T20:32:16.3492355Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3492360Z 
2025-05-07T20:32:16.3492452Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3492575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3492676Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3492770Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3493333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3493515Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3493866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3494087Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3494416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3494581Z     kernel = self.compile(
2025-05-07T20:32:16.3494955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3495122Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3495243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3495253Z 
2025-05-07T20:32:16.3495452Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43ed2bd0>
2025-05-07T20:32:16.3496211Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3496706Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a704ddf80>}
2025-05-07T20:32:16.3497439Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3497626Z context = <triton._C.libtriton.ir.context object at 0x7f6a42150230>
2025-05-07T20:32:16.3497630Z 
2025-05-07T20:32:16.3497789Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3498043Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3498157Z                            module_map=module_map)
2025-05-07T20:32:16.3498338Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3498444Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3498535Z E       ^
2025-05-07T20:32:16.3498879Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3498888Z 
2025-05-07T20:32:16.3499292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3499297Z 
2025-05-07T20:32:16.3499396Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3499611Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3499686Z     T=4096,
2025-05-07T20:32:16.3499756Z     D=7168,
2025-05-07T20:32:16.3499839Z     scale_ub=1200.0,
2025-05-07T20:32:16.3499922Z     contiguous=False,
2025-05-07T20:32:16.3500006Z     compiled=True,
2025-05-07T20:32:16.3500078Z )
2025-05-07T20:32:16.3500289Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3500459Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.3500463Z 
2025-05-07T20:32:16.3500538Z     @given(
2025-05-07T20:32:16.3500653Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3500750Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3500864Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3500979Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3501091Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3501163Z     )
2025-05-07T20:32:16.3501399Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3501492Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3501566Z         self,
2025-05-07T20:32:16.3501637Z         T: int,
2025-05-07T20:32:16.3501886Z         D: int,
2025-05-07T20:32:16.3501986Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3502072Z         contiguous: bool,
2025-05-07T20:32:16.3502156Z         compiled: bool,
2025-05-07T20:32:16.3502232Z     ) -> None:
2025-05-07T20:32:16.3502321Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3502391Z     
2025-05-07T20:32:16.3502556Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3502707Z     
2025-05-07T20:32:16.3502797Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3502919Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3503005Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3503079Z         x0 = x[:, :D]
2025-05-07T20:32:16.3503156Z         x1 = x[:, D:]
2025-05-07T20:32:16.3503229Z     
2025-05-07T20:32:16.3503308Z         if contiguous:
2025-05-07T20:32:16.3503393Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3503481Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3503550Z     
2025-05-07T20:32:16.3503642Z         if scale_ub is not None:
2025-05-07T20:32:16.3503747Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3503876Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3503948Z             )
2025-05-07T20:32:16.3504025Z         else:
2025-05-07T20:32:16.3504117Z             scale_ub_tensor = None
2025-05-07T20:32:16.3504190Z     
2025-05-07T20:32:16.3504322Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3504409Z             op = silu_mul_quant
2025-05-07T20:32:16.3504495Z             if compiled:
2025-05-07T20:32:16.3504590Z                 op = torch.compile(op)
2025-05-07T20:32:16.3504691Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3504761Z     
2025-05-07T20:32:16.3504848Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3504853Z 
2025-05-07T20:32:16.3504946Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3505078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3505175Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3505272Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3505631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3505719Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3506204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3506302Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3506650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3506869Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3507200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3507297Z     kernel = self.compile(
2025-05-07T20:32:16.3507669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3507837Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3507961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3507966Z 
2025-05-07T20:32:16.3508164Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43fedb20>
2025-05-07T20:32:16.3508925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3509415Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a701edb20>}
2025-05-07T20:32:16.3510227Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3510415Z context = <triton._C.libtriton.ir.context object at 0x7f6a42459fb0>
2025-05-07T20:32:16.3510420Z 
2025-05-07T20:32:16.3510578Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3510913Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3511014Z                            module_map=module_map)
2025-05-07T20:32:16.3511170Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3511269Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3511341Z E       ^
2025-05-07T20:32:16.3511690Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3511695Z 
2025-05-07T20:32:16.3512100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3512104Z 
2025-05-07T20:32:16.3512200Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3512419Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3512491Z     T=128,
2025-05-07T20:32:16.3512566Z     D=7168,
2025-05-07T20:32:16.3512651Z     scale_ub=1200.0,
2025-05-07T20:32:16.3512735Z     contiguous=False,
2025-05-07T20:32:16.3512817Z     compiled=True,
2025-05-07T20:32:16.3512885Z )
2025-05-07T20:32:16.3513097Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3513264Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.3513269Z 
2025-05-07T20:32:16.3513344Z     @given(
2025-05-07T20:32:16.3513464Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3513561Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3513673Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3513785Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3513899Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3513970Z     )
2025-05-07T20:32:16.3514210Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3514302Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3514378Z         self,
2025-05-07T20:32:16.3514455Z         T: int,
2025-05-07T20:32:16.3514527Z         D: int,
2025-05-07T20:32:16.3514621Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3514709Z         contiguous: bool,
2025-05-07T20:32:16.3514790Z         compiled: bool,
2025-05-07T20:32:16.3514862Z     ) -> None:
2025-05-07T20:32:16.3514954Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3515024Z     
2025-05-07T20:32:16.3515187Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3515266Z     
2025-05-07T20:32:16.3515355Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3515480Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3515564Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3515641Z         x0 = x[:, :D]
2025-05-07T20:32:16.3515719Z         x1 = x[:, D:]
2025-05-07T20:32:16.3515786Z     
2025-05-07T20:32:16.3515868Z         if contiguous:
2025-05-07T20:32:16.3515959Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3516049Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3516116Z     
2025-05-07T20:32:16.3516207Z         if scale_ub is not None:
2025-05-07T20:32:16.3516309Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3516438Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3516513Z             )
2025-05-07T20:32:16.3516586Z         else:
2025-05-07T20:32:16.3516677Z             scale_ub_tensor = None
2025-05-07T20:32:16.3516749Z     
2025-05-07T20:32:16.3516957Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3517047Z             op = silu_mul_quant
2025-05-07T20:32:16.3517128Z             if compiled:
2025-05-07T20:32:16.3517223Z                 op = torch.compile(op)
2025-05-07T20:32:16.3517327Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3517398Z     
2025-05-07T20:32:16.3517484Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3517488Z 
2025-05-07T20:32:16.3517584Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3517785Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3517883Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3517981Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3518384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3518482Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3518969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3519066Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3519419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3519633Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3519964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3520060Z     kernel = self.compile(
2025-05-07T20:32:16.3520430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3520598Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3520720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3520724Z 
2025-05-07T20:32:16.3520928Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a43b55bb0>
2025-05-07T20:32:16.3521685Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3522174Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a70252480>}
2025-05-07T20:32:16.3522910Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3523094Z context = <triton._C.libtriton.ir.context object at 0x7f6a4247a030>
2025-05-07T20:32:16.3523098Z 
2025-05-07T20:32:16.3523260Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3523518Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3523620Z                            module_map=module_map)
2025-05-07T20:32:16.3523781Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3523877Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3523951Z E       ^
2025-05-07T20:32:16.3524300Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3524309Z 
2025-05-07T20:32:16.3524711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3524715Z 
2025-05-07T20:32:16.3524815Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3525031Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3525104Z     T=2048,
2025-05-07T20:32:16.3525183Z     D=7168,
2025-05-07T20:32:16.3525259Z     scale_ub=None,
2025-05-07T20:32:16.3525423Z     contiguous=True,
2025-05-07T20:32:16.3525507Z     compiled=True,
2025-05-07T20:32:16.3525577Z )
2025-05-07T20:32:16.3525790Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3525952Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.3525956Z 
2025-05-07T20:32:16.3526029Z     @given(
2025-05-07T20:32:16.3526147Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3526345Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3526463Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3526576Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3526692Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3526761Z     )
2025-05-07T20:32:16.3526998Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3527091Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3527169Z         self,
2025-05-07T20:32:16.3527242Z         T: int,
2025-05-07T20:32:16.3527317Z         D: int,
2025-05-07T20:32:16.3527413Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3527500Z         contiguous: bool,
2025-05-07T20:32:16.3527590Z         compiled: bool,
2025-05-07T20:32:16.3527666Z     ) -> None:
2025-05-07T20:32:16.3527758Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3527831Z     
2025-05-07T20:32:16.3528001Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3528072Z     
2025-05-07T20:32:16.3528162Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3528284Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3528374Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3528452Z         x0 = x[:, :D]
2025-05-07T20:32:16.3528529Z         x1 = x[:, D:]
2025-05-07T20:32:16.3528601Z     
2025-05-07T20:32:16.3528681Z         if contiguous:
2025-05-07T20:32:16.3528768Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3528860Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3528927Z     
2025-05-07T20:32:16.3529016Z         if scale_ub is not None:
2025-05-07T20:32:16.3529121Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3529250Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3529323Z             )
2025-05-07T20:32:16.3529396Z         else:
2025-05-07T20:32:16.3529486Z             scale_ub_tensor = None
2025-05-07T20:32:16.3529565Z     
2025-05-07T20:32:16.3529693Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3529779Z             op = silu_mul_quant
2025-05-07T20:32:16.3529861Z             if compiled:
2025-05-07T20:32:16.3529957Z                 op = torch.compile(op)
2025-05-07T20:32:16.3530056Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3530128Z     
2025-05-07T20:32:16.3530213Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3530217Z 
2025-05-07T20:32:16.3530307Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3530438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3530536Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3530633Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3530990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3531077Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3531567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3531660Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3532005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3532224Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3532634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3532732Z     kernel = self.compile(
2025-05-07T20:32:16.3533151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3533319Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3533446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3533527Z 
2025-05-07T20:32:16.3533726Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a582408f0>
2025-05-07T20:32:16.3534483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3534984Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a42ce5da0>}
2025-05-07T20:32:16.3535709Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3535895Z context = <triton._C.libtriton.ir.context object at 0x7f6a4212c3f0>
2025-05-07T20:32:16.3535900Z 
2025-05-07T20:32:16.3536060Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3536323Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3536425Z                            module_map=module_map)
2025-05-07T20:32:16.3536578Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3536675Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3536746Z E       ^
2025-05-07T20:32:16.3537096Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3537106Z 
2025-05-07T20:32:16.3537507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3537511Z 
2025-05-07T20:32:16.3537610Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3537829Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3537903Z     T=16384,
2025-05-07T20:32:16.3537978Z     D=5120,
2025-05-07T20:32:16.3538057Z     scale_ub=None,
2025-05-07T20:32:16.3538140Z     contiguous=False,
2025-05-07T20:32:16.3538221Z     compiled=False,
2025-05-07T20:32:16.3538294Z )
2025-05-07T20:32:16.3538506Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3538681Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.3538685Z 
2025-05-07T20:32:16.3538757Z     @given(
2025-05-07T20:32:16.3538873Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3538979Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3539088Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3539200Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3539311Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3539385Z     )
2025-05-07T20:32:16.3539626Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3539721Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3539794Z         self,
2025-05-07T20:32:16.3539870Z         T: int,
2025-05-07T20:32:16.3539945Z         D: int,
2025-05-07T20:32:16.3540037Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3540125Z         contiguous: bool,
2025-05-07T20:32:16.3540204Z         compiled: bool,
2025-05-07T20:32:16.3540278Z     ) -> None:
2025-05-07T20:32:16.3540373Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3540443Z     
2025-05-07T20:32:16.3540687Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3540767Z     
2025-05-07T20:32:16.3540854Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3540975Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3542754Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3542836Z 
2025-05-07T20:32:16.3542953Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:16.3542958Z 
2025-05-07T20:32:16.3543054Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3543272Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3543350Z     T=4096,
2025-05-07T20:32:16.3543424Z     D=7168,
2025-05-07T20:32:16.3543503Z     scale_ub=1200.0,
2025-05-07T20:32:16.3543587Z     contiguous=True,
2025-05-07T20:32:16.3543665Z     compiled=True,
2025-05-07T20:32:16.3543736Z )
2025-05-07T20:32:16.3543950Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3544119Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.3544124Z 
2025-05-07T20:32:16.3544202Z     @given(
2025-05-07T20:32:16.3544314Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3544410Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3544521Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3544632Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3544746Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3544819Z     )
2025-05-07T20:32:16.3545056Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3545144Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3545222Z         self,
2025-05-07T20:32:16.3545297Z         T: int,
2025-05-07T20:32:16.3545369Z         D: int,
2025-05-07T20:32:16.3545466Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3545561Z         contiguous: bool,
2025-05-07T20:32:16.3545647Z         compiled: bool,
2025-05-07T20:32:16.3545721Z     ) -> None:
2025-05-07T20:32:16.3545809Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3545881Z     
2025-05-07T20:32:16.3546043Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3546114Z     
2025-05-07T20:32:16.3546206Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3546326Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3548087Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3548097Z 
2025-05-07T20:32:16.3548211Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:16.3548216Z 
2025-05-07T20:32:16.3548314Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3548531Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3548607Z     T=16384,
2025-05-07T20:32:16.3548682Z     D=7168,
2025-05-07T20:32:16.3548761Z     scale_ub=None,
2025-05-07T20:32:16.3548844Z     contiguous=False,
2025-05-07T20:32:16.3549007Z     compiled=False,
2025-05-07T20:32:16.3549078Z )
2025-05-07T20:32:16.3549288Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3549462Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.3549467Z 
2025-05-07T20:32:16.3549541Z     @given(
2025-05-07T20:32:16.3549652Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3549829Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3549939Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3550055Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3550163Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3550234Z     )
2025-05-07T20:32:16.3550476Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3550565Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3550638Z         self,
2025-05-07T20:32:16.3550719Z         T: int,
2025-05-07T20:32:16.3550793Z         D: int,
2025-05-07T20:32:16.3550887Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3550975Z         contiguous: bool,
2025-05-07T20:32:16.3551057Z         compiled: bool,
2025-05-07T20:32:16.3551132Z     ) -> None:
2025-05-07T20:32:16.3551228Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3551295Z     
2025-05-07T20:32:16.3551459Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3553226Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3553232Z 
2025-05-07T20:32:16.3553348Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.3553353Z 
2025-05-07T20:32:16.3553449Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3553661Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3553739Z     T=2048,
2025-05-07T20:32:16.3553814Z     D=7168,
2025-05-07T20:32:16.3553891Z     scale_ub=1200.0,
2025-05-07T20:32:16.3553972Z     contiguous=True,
2025-05-07T20:32:16.3554050Z     compiled=True,
2025-05-07T20:32:16.3554119Z )
2025-05-07T20:32:16.3554335Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3554497Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.3554502Z 
2025-05-07T20:32:16.3554579Z     @given(
2025-05-07T20:32:16.3554691Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3554793Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3554904Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3555016Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3555127Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3555201Z     )
2025-05-07T20:32:16.3555437Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3555536Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3555611Z         self,
2025-05-07T20:32:16.3555686Z         T: int,
2025-05-07T20:32:16.3555764Z         D: int,
2025-05-07T20:32:16.3555858Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3555943Z         contiguous: bool,
2025-05-07T20:32:16.3556025Z         compiled: bool,
2025-05-07T20:32:16.3556100Z     ) -> None:
2025-05-07T20:32:16.3556192Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3556264Z     
2025-05-07T20:32:16.3556511Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3556584Z     
2025-05-07T20:32:16.3556679Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3556798Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3558542Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3558646Z 
2025-05-07T20:32:16.3558761Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:16.3558765Z 
2025-05-07T20:32:16.3558866Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3559087Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3559158Z     T=2048,
2025-05-07T20:32:16.3559572Z     D=7168,
2025-05-07T20:32:16.3559668Z     scale_ub=None,
2025-05-07T20:32:16.3559748Z     contiguous=True,
2025-05-07T20:32:16.3559832Z     compiled=False,
2025-05-07T20:32:16.3559904Z )
2025-05-07T20:32:16.3560113Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3560290Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.3560295Z 
2025-05-07T20:32:16.3560370Z     @given(
2025-05-07T20:32:16.3560489Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3560586Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3560694Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3560811Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3560919Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3560997Z     )
2025-05-07T20:32:16.3561243Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3561330Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3561403Z         self,
2025-05-07T20:32:16.3561481Z         T: int,
2025-05-07T20:32:16.3561555Z         D: int,
2025-05-07T20:32:16.3561647Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3561735Z         contiguous: bool,
2025-05-07T20:32:16.3561819Z         compiled: bool,
2025-05-07T20:32:16.3561897Z     ) -> None:
2025-05-07T20:32:16.3561986Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3562058Z     
2025-05-07T20:32:16.3562225Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3562294Z     
2025-05-07T20:32:16.3562383Z >       x_sign = torch.sign(x)
2025-05-07T20:32:16.3564121Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3564131Z 
2025-05-07T20:32:16.3564245Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:16.3564249Z 
2025-05-07T20:32:16.3564348Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3564565Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3564638Z     T=1,
2025-05-07T20:32:16.3564719Z     D=7168,
2025-05-07T20:32:16.3564800Z     scale_ub=1200.0,
2025-05-07T20:32:16.3564886Z     contiguous=True,
2025-05-07T20:32:16.3564968Z     compiled=False,
2025-05-07T20:32:16.3565037Z )
2025-05-07T20:32:16.3565393Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3565553Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.3565558Z 
2025-05-07T20:32:16.3565632Z     @given(
2025-05-07T20:32:16.3565749Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3565844Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3565954Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3566181Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3566293Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3566368Z     )
2025-05-07T20:32:16.3566608Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3566699Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3566778Z         self,
2025-05-07T20:32:16.3566852Z         T: int,
2025-05-07T20:32:16.3566925Z         D: int,
2025-05-07T20:32:16.3567020Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3567113Z         contiguous: bool,
2025-05-07T20:32:16.3567196Z         compiled: bool,
2025-05-07T20:32:16.3567275Z     ) -> None:
2025-05-07T20:32:16.3567365Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3567437Z     
2025-05-07T20:32:16.3567601Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3567670Z     
2025-05-07T20:32:16.3567763Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3567889Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3567975Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3568057Z         x0 = x[:, :D]
2025-05-07T20:32:16.3568133Z         x1 = x[:, D:]
2025-05-07T20:32:16.3568202Z     
2025-05-07T20:32:16.3568290Z         if contiguous:
2025-05-07T20:32:16.3568391Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3568489Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3568574Z     
2025-05-07T20:32:16.3568677Z         if scale_ub is not None:
2025-05-07T20:32:16.3568786Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3568918Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3568991Z             )
2025-05-07T20:32:16.3569068Z         else:
2025-05-07T20:32:16.3569158Z             scale_ub_tensor = None
2025-05-07T20:32:16.3569226Z     
2025-05-07T20:32:16.3569356Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3569442Z             op = silu_mul_quant
2025-05-07T20:32:16.3569531Z             if compiled:
2025-05-07T20:32:16.3569631Z                 op = torch.compile(op)
2025-05-07T20:32:16.3569732Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3569804Z     
2025-05-07T20:32:16.3569897Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3569901Z 
2025-05-07T20:32:16.3569997Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3570121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3570220Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3570321Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3570814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3570908Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3571258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3571487Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3571819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3571913Z     kernel = self.compile(
2025-05-07T20:32:16.3572286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3572457Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3572666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3572671Z 
2025-05-07T20:32:16.3572872Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a5916ff20>
2025-05-07T20:32:16.3573680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3574251Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a421c1440>}
2025-05-07T20:32:16.3574982Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3575172Z context = <triton._C.libtriton.ir.context object at 0x7f6a42279b30>
2025-05-07T20:32:16.3575182Z 
2025-05-07T20:32:16.3575343Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3575603Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3575708Z                            module_map=module_map)
2025-05-07T20:32:16.3575865Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3575969Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3576043Z E       ^
2025-05-07T20:32:16.3576386Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3576391Z 
2025-05-07T20:32:16.3576796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3576800Z 
2025-05-07T20:32:16.3576899Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3577121Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3577194Z     T=128,
2025-05-07T20:32:16.3577268Z     D=5120,
2025-05-07T20:32:16.3577348Z     scale_ub=None,
2025-05-07T20:32:16.3577431Z     contiguous=True,
2025-05-07T20:32:16.3577513Z     compiled=False,
2025-05-07T20:32:16.3577584Z )
2025-05-07T20:32:16.3577795Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3577961Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.3577969Z 
2025-05-07T20:32:16.3578042Z     @given(
2025-05-07T20:32:16.3578155Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3578255Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3578364Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3578478Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3578593Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3578666Z     )
2025-05-07T20:32:16.3578909Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3579003Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3579080Z         self,
2025-05-07T20:32:16.3579159Z         T: int,
2025-05-07T20:32:16.3579231Z         D: int,
2025-05-07T20:32:16.3579326Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3579416Z         contiguous: bool,
2025-05-07T20:32:16.3579499Z         compiled: bool,
2025-05-07T20:32:16.3579581Z     ) -> None:
2025-05-07T20:32:16.3579677Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3579749Z     
2025-05-07T20:32:16.3579910Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3579984Z     
2025-05-07T20:32:16.3580073Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3580196Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3580286Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3580360Z         x0 = x[:, :D]
2025-05-07T20:32:16.3580521Z         x1 = x[:, D:]
2025-05-07T20:32:16.3580597Z     
2025-05-07T20:32:16.3580677Z         if contiguous:
2025-05-07T20:32:16.3580768Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3580853Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3580924Z     
2025-05-07T20:32:16.3581013Z         if scale_ub is not None:
2025-05-07T20:32:16.3581116Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3581248Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3581401Z             )
2025-05-07T20:32:16.3581476Z         else:
2025-05-07T20:32:16.3581568Z             scale_ub_tensor = None
2025-05-07T20:32:16.3581638Z     
2025-05-07T20:32:16.3581763Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3581848Z             op = silu_mul_quant
2025-05-07T20:32:16.3581933Z             if compiled:
2025-05-07T20:32:16.3582028Z                 op = torch.compile(op)
2025-05-07T20:32:16.3582135Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3582209Z     
2025-05-07T20:32:16.3582295Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3582300Z 
2025-05-07T20:32:16.3582397Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3582522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3582618Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3582719Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3583207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3583308Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3583656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3583874Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3584207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3588190Z     kernel = self.compile(
2025-05-07T20:32:16.3588617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3588789Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3588917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3588928Z 
2025-05-07T20:32:16.3589128Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a5916eb10>
2025-05-07T20:32:16.3589888Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3590380Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a421c2520>}
2025-05-07T20:32:16.3591111Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3591299Z context = <triton._C.libtriton.ir.context object at 0x7f6a42008970>
2025-05-07T20:32:16.3591304Z 
2025-05-07T20:32:16.3591466Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3591727Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3591832Z                            module_map=module_map)
2025-05-07T20:32:16.3591990Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3592089Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3592162Z E       ^
2025-05-07T20:32:16.3592512Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3592644Z 
2025-05-07T20:32:16.3593051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3593055Z 
2025-05-07T20:32:16.3593152Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3593373Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3593446Z     T=128,
2025-05-07T20:32:16.3593603Z     D=7168,
2025-05-07T20:32:16.3593684Z     scale_ub=None,
2025-05-07T20:32:16.3593763Z     contiguous=True,
2025-05-07T20:32:16.3593850Z     compiled=False,
2025-05-07T20:32:16.3593921Z )
2025-05-07T20:32:16.3594133Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3594300Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.3594304Z 
2025-05-07T20:32:16.3594377Z     @given(
2025-05-07T20:32:16.3594492Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3594597Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3594708Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3594818Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3594934Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3595005Z     )
2025-05-07T20:32:16.3595248Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3595343Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3595415Z         self,
2025-05-07T20:32:16.3595491Z         T: int,
2025-05-07T20:32:16.3595562Z         D: int,
2025-05-07T20:32:16.3595658Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3595748Z         contiguous: bool,
2025-05-07T20:32:16.3595829Z         compiled: bool,
2025-05-07T20:32:16.3595901Z     ) -> None:
2025-05-07T20:32:16.3595997Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3596065Z     
2025-05-07T20:32:16.3596232Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3596305Z     
2025-05-07T20:32:16.3596394Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3596519Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3596606Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3596681Z         x0 = x[:, :D]
2025-05-07T20:32:16.3596763Z         x1 = x[:, D:]
2025-05-07T20:32:16.3596833Z     
2025-05-07T20:32:16.3596912Z         if contiguous:
2025-05-07T20:32:16.3597013Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3597099Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3597167Z     
2025-05-07T20:32:16.3597258Z         if scale_ub is not None:
2025-05-07T20:32:16.3597359Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3597491Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3597564Z             )
2025-05-07T20:32:16.3597637Z         else:
2025-05-07T20:32:16.3597734Z             scale_ub_tensor = None
2025-05-07T20:32:16.3597805Z     
2025-05-07T20:32:16.3597933Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3598023Z             op = silu_mul_quant
2025-05-07T20:32:16.3598103Z             if compiled:
2025-05-07T20:32:16.3598205Z                 op = torch.compile(op)
2025-05-07T20:32:16.3598332Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3598415Z     
2025-05-07T20:32:16.3598514Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3598523Z 
2025-05-07T20:32:16.3598618Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3598743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3598842Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3598942Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3599428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3599527Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3599960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3600179Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3600516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3600605Z     kernel = self.compile(
2025-05-07T20:32:16.3600982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3601226Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3601348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3601352Z 
2025-05-07T20:32:16.3601554Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a59131760>
2025-05-07T20:32:16.3602318Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3602813Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a421c3420>}
2025-05-07T20:32:16.3603541Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3603731Z context = <triton._C.libtriton.ir.context object at 0x7f6a420ef0b0>
2025-05-07T20:32:16.3603736Z 
2025-05-07T20:32:16.3603900Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3604151Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3604255Z                            module_map=module_map)
2025-05-07T20:32:16.3604415Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3604510Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3604585Z E       ^
2025-05-07T20:32:16.3604935Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3604940Z 
2025-05-07T20:32:16.3605347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3605356Z 
2025-05-07T20:32:16.3605453Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3605670Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3605748Z     T=2048,
2025-05-07T20:32:16.3605821Z     D=7168,
2025-05-07T20:32:16.3605899Z     scale_ub=1200.0,
2025-05-07T20:32:16.3605982Z     contiguous=True,
2025-05-07T20:32:16.3606062Z     compiled=False,
2025-05-07T20:32:16.3606131Z )
2025-05-07T20:32:16.3606350Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3606519Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.3606523Z 
2025-05-07T20:32:16.3606599Z     @given(
2025-05-07T20:32:16.3606714Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3606808Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3606923Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3607044Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3607154Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3607228Z     )
2025-05-07T20:32:16.3607469Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3607559Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3607632Z         self,
2025-05-07T20:32:16.3607706Z         T: int,
2025-05-07T20:32:16.3607786Z         D: int,
2025-05-07T20:32:16.3607962Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3608049Z         contiguous: bool,
2025-05-07T20:32:16.3608133Z         compiled: bool,
2025-05-07T20:32:16.3608206Z     ) -> None:
2025-05-07T20:32:16.3608311Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3608394Z     
2025-05-07T20:32:16.3608582Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3610338Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3610446Z 
2025-05-07T20:32:16.3610564Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.3610568Z 
2025-05-07T20:32:16.3610664Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3610885Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3610957Z     T=1,
2025-05-07T20:32:16.3611034Z     D=5120,
2025-05-07T20:32:16.3611113Z     scale_ub=1200.0,
2025-05-07T20:32:16.3611192Z     contiguous=True,
2025-05-07T20:32:16.3611282Z     compiled=False,
2025-05-07T20:32:16.3611348Z )
2025-05-07T20:32:16.3611559Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3611720Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.3611725Z 
2025-05-07T20:32:16.3611798Z     @given(
2025-05-07T20:32:16.3611910Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3612006Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3612115Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3612231Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3612340Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3612413Z     )
2025-05-07T20:32:16.3612654Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3612744Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3612816Z         self,
2025-05-07T20:32:16.3612891Z         T: int,
2025-05-07T20:32:16.3612971Z         D: int,
2025-05-07T20:32:16.3613130Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3613219Z         contiguous: bool,
2025-05-07T20:32:16.3613301Z         compiled: bool,
2025-05-07T20:32:16.3613375Z     ) -> None:
2025-05-07T20:32:16.3613468Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3613538Z     
2025-05-07T20:32:16.3613700Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3613769Z     
2025-05-07T20:32:16.3613856Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3613988Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3614073Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3614150Z         x0 = x[:, :D]
2025-05-07T20:32:16.3614228Z         x1 = x[:, D:]
2025-05-07T20:32:16.3614299Z     
2025-05-07T20:32:16.3614378Z         if contiguous:
2025-05-07T20:32:16.3614469Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3614556Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3614627Z     
2025-05-07T20:32:16.3614719Z         if scale_ub is not None:
2025-05-07T20:32:16.3614820Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3614949Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3615025Z             )
2025-05-07T20:32:16.3615097Z         else:
2025-05-07T20:32:16.3615189Z             scale_ub_tensor = None
2025-05-07T20:32:16.3615256Z     
2025-05-07T20:32:16.3615381Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3615473Z             op = silu_mul_quant
2025-05-07T20:32:16.3615638Z             if compiled:
2025-05-07T20:32:16.3615735Z                 op = torch.compile(op)
2025-05-07T20:32:16.3615842Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3615912Z     
2025-05-07T20:32:16.3615998Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3616003Z 
2025-05-07T20:32:16.3616099Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3616225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3616399Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3616495Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3616982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3617080Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3617430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3617653Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3617991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3618082Z     kernel = self.compile(
2025-05-07T20:32:16.3618508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3618681Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3618804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3618808Z 
2025-05-07T20:32:16.3619009Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a59131790>
2025-05-07T20:32:16.3619773Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3620265Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a420149a0>}
2025-05-07T20:32:16.3620993Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3621182Z context = <triton._C.libtriton.ir.context object at 0x7f6a42082bb0>
2025-05-07T20:32:16.3621190Z 
2025-05-07T20:32:16.3621353Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3621605Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3621714Z                            module_map=module_map)
2025-05-07T20:32:16.3621871Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3621970Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3622046Z E       ^
2025-05-07T20:32:16.3622388Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3622392Z 
2025-05-07T20:32:16.3622796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3622801Z 
2025-05-07T20:32:16.3622904Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3623120Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3623194Z     T=2048,
2025-05-07T20:32:16.3623266Z     D=5120,
2025-05-07T20:32:16.3623343Z     scale_ub=None,
2025-05-07T20:32:16.3623435Z     contiguous=True,
2025-05-07T20:32:16.3623514Z     compiled=False,
2025-05-07T20:32:16.3623584Z )
2025-05-07T20:32:16.3623797Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3624070Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.3624075Z 
2025-05-07T20:32:16.3624153Z     @given(
2025-05-07T20:32:16.3624268Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3624363Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3624476Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3624587Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3624768Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3624843Z     )
2025-05-07T20:32:16.3625081Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3625169Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3625247Z         self,
2025-05-07T20:32:16.3625320Z         T: int,
2025-05-07T20:32:16.3625397Z         D: int,
2025-05-07T20:32:16.3625490Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3625577Z         contiguous: bool,
2025-05-07T20:32:16.3625660Z         compiled: bool,
2025-05-07T20:32:16.3625738Z     ) -> None:
2025-05-07T20:32:16.3625829Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3625904Z     
2025-05-07T20:32:16.3626067Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3626138Z     
2025-05-07T20:32:16.3626228Z >       x_sign = torch.sign(x)
2025-05-07T20:32:16.3627974Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3627987Z 
2025-05-07T20:32:16.3628108Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:16.3628112Z 
2025-05-07T20:32:16.3628210Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3628427Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3628499Z     T=16384,
2025-05-07T20:32:16.3628571Z     D=5120,
2025-05-07T20:32:16.3628648Z     scale_ub=None,
2025-05-07T20:32:16.3628729Z     contiguous=True,
2025-05-07T20:32:16.3628807Z     compiled=False,
2025-05-07T20:32:16.3628882Z )
2025-05-07T20:32:16.3629092Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3629262Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.3629266Z 
2025-05-07T20:32:16.3629343Z     @given(
2025-05-07T20:32:16.3629455Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3629554Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3629664Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3629779Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3629891Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3629960Z     )
2025-05-07T20:32:16.3630199Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3630289Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3630360Z         self,
2025-05-07T20:32:16.3630430Z         T: int,
2025-05-07T20:32:16.3630513Z         D: int,
2025-05-07T20:32:16.3630607Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3630693Z         contiguous: bool,
2025-05-07T20:32:16.3630779Z         compiled: bool,
2025-05-07T20:32:16.3630849Z     ) -> None:
2025-05-07T20:32:16.3630944Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3631011Z     
2025-05-07T20:32:16.3631171Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3633081Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3633157Z 
2025-05-07T20:32:16.3633270Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.3633274Z 
2025-05-07T20:32:16.3633375Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3633590Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3633662Z     T=4096,
2025-05-07T20:32:16.3633739Z     D=5120,
2025-05-07T20:32:16.3633816Z     scale_ub=None,
2025-05-07T20:32:16.3633897Z     contiguous=True,
2025-05-07T20:32:16.3633980Z     compiled=False,
2025-05-07T20:32:16.3634051Z )
2025-05-07T20:32:16.3634268Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3634436Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.3634440Z 
2025-05-07T20:32:16.3634517Z     @given(
2025-05-07T20:32:16.3634633Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3634727Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3634842Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3634956Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3635065Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3635137Z     )
2025-05-07T20:32:16.3635376Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3635464Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3635541Z         self,
2025-05-07T20:32:16.3635615Z         T: int,
2025-05-07T20:32:16.3635688Z         D: int,
2025-05-07T20:32:16.3635789Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3635873Z         contiguous: bool,
2025-05-07T20:32:16.3635953Z         compiled: bool,
2025-05-07T20:32:16.3636028Z     ) -> None:
2025-05-07T20:32:16.3636118Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3636185Z     
2025-05-07T20:32:16.3636350Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3638082Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3638097Z 
2025-05-07T20:32:16.3638214Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.3638218Z 
2025-05-07T20:32:16.3638314Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3638532Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3638602Z     T=2048,
2025-05-07T20:32:16.3638675Z     D=5120,
2025-05-07T20:32:16.3638756Z     scale_ub=None,
2025-05-07T20:32:16.3638842Z     contiguous=False,
2025-05-07T20:32:16.3638921Z     compiled=False,
2025-05-07T20:32:16.3638993Z )
2025-05-07T20:32:16.3639205Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3639371Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.3639375Z 
2025-05-07T20:32:16.3639453Z     @given(
2025-05-07T20:32:16.3639565Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3639661Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3639850Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3639965Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3640077Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3640146Z     )
2025-05-07T20:32:16.3640382Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3640474Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3640547Z         self,
2025-05-07T20:32:16.3640703Z         T: int,
2025-05-07T20:32:16.3640780Z         D: int,
2025-05-07T20:32:16.3640875Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3640959Z         contiguous: bool,
2025-05-07T20:32:16.3641045Z         compiled: bool,
2025-05-07T20:32:16.3641118Z     ) -> None:
2025-05-07T20:32:16.3641211Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3641278Z     
2025-05-07T20:32:16.3641440Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3643181Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3643194Z 
2025-05-07T20:32:16.3643307Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.3643311Z 
2025-05-07T20:32:16.3643413Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3643628Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3643698Z     T=4096,
2025-05-07T20:32:16.3643773Z     D=7168,
2025-05-07T20:32:16.3643851Z     scale_ub=None,
2025-05-07T20:32:16.3643929Z     contiguous=True,
2025-05-07T20:32:16.3644012Z     compiled=True,
2025-05-07T20:32:16.3644082Z )
2025-05-07T20:32:16.3644293Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3644461Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.3644465Z 
2025-05-07T20:32:16.3644538Z     @given(
2025-05-07T20:32:16.3644653Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3644750Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3644860Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3644976Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3645085Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3645157Z     )
2025-05-07T20:32:16.3645398Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3645486Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3645564Z         self,
2025-05-07T20:32:16.3645640Z         T: int,
2025-05-07T20:32:16.3645713Z         D: int,
2025-05-07T20:32:16.3645810Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3645897Z         contiguous: bool,
2025-05-07T20:32:16.3645980Z         compiled: bool,
2025-05-07T20:32:16.3646057Z     ) -> None:
2025-05-07T20:32:16.3646149Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3646221Z     
2025-05-07T20:32:16.3646384Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3648209Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3648217Z 
2025-05-07T20:32:16.3648333Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.3648338Z 
2025-05-07T20:32:16.3648445Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3648704Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3648778Z     T=2048,
2025-05-07T20:32:16.3648855Z     D=5120,
2025-05-07T20:32:16.3649035Z     scale_ub=1200.0,
2025-05-07T20:32:16.3649115Z     contiguous=False,
2025-05-07T20:32:16.3649197Z     compiled=False,
2025-05-07T20:32:16.3649267Z )
2025-05-07T20:32:16.3649476Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3649647Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.3649652Z 
2025-05-07T20:32:16.3649725Z     @given(
2025-05-07T20:32:16.3649842Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3649942Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3650053Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3650168Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3650278Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3650345Z     )
2025-05-07T20:32:16.3650586Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3650680Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3650753Z         self,
2025-05-07T20:32:16.3650829Z         T: int,
2025-05-07T20:32:16.3650903Z         D: int,
2025-05-07T20:32:16.3650998Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3651086Z         contiguous: bool,
2025-05-07T20:32:16.3651167Z         compiled: bool,
2025-05-07T20:32:16.3651244Z     ) -> None:
2025-05-07T20:32:16.3651335Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3651402Z     
2025-05-07T20:32:16.3651569Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3653364Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3653374Z 
2025-05-07T20:32:16.3653494Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.3653498Z 
2025-05-07T20:32:16.3653595Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3653809Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3653881Z     T=4096,
2025-05-07T20:32:16.3653955Z     D=7168,
2025-05-07T20:32:16.3654034Z     scale_ub=1200.0,
2025-05-07T20:32:16.3654120Z     contiguous=True,
2025-05-07T20:32:16.3654199Z     compiled=False,
2025-05-07T20:32:16.3654271Z )
2025-05-07T20:32:16.3654480Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3654646Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.3654650Z 
2025-05-07T20:32:16.3654721Z     @given(
2025-05-07T20:32:16.3654837Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3654934Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3655044Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3655156Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3655263Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3655335Z     )
2025-05-07T20:32:16.3655572Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3655752Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3655827Z         self,
2025-05-07T20:32:16.3655901Z         T: int,
2025-05-07T20:32:16.3655977Z         D: int,
2025-05-07T20:32:16.3656072Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3656157Z         contiguous: bool,
2025-05-07T20:32:16.3656240Z         compiled: bool,
2025-05-07T20:32:16.3656313Z     ) -> None:
2025-05-07T20:32:16.3656403Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3656548Z     
2025-05-07T20:32:16.3656709Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3658498Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3658506Z 
2025-05-07T20:32:16.3658625Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.3658629Z 
2025-05-07T20:32:16.3658729Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3658942Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3659022Z     T=16384,
2025-05-07T20:32:16.3659096Z     D=7168,
2025-05-07T20:32:16.3659445Z     scale_ub=None,
2025-05-07T20:32:16.3659574Z     contiguous=False,
2025-05-07T20:32:16.3659675Z     compiled=True,
2025-05-07T20:32:16.3659749Z )
2025-05-07T20:32:16.3659959Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3660132Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.3660137Z 
2025-05-07T20:32:16.3660212Z     @given(
2025-05-07T20:32:16.3660332Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3660427Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3660538Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3660652Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3660760Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3660832Z     )
2025-05-07T20:32:16.3661073Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3661170Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3661241Z         self,
2025-05-07T20:32:16.3661321Z         T: int,
2025-05-07T20:32:16.3661394Z         D: int,
2025-05-07T20:32:16.3661487Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3661577Z         contiguous: bool,
2025-05-07T20:32:16.3661659Z         compiled: bool,
2025-05-07T20:32:16.3661738Z     ) -> None:
2025-05-07T20:32:16.3661830Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3661903Z     
2025-05-07T20:32:16.3662071Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3663826Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3663835Z 
2025-05-07T20:32:16.3663951Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.3663955Z 
2025-05-07T20:32:16.3664052Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3664267Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3664488Z     T=4096,
2025-05-07T20:32:16.3664565Z     D=7168,
2025-05-07T20:32:16.3664643Z     scale_ub=None,
2025-05-07T20:32:16.3664726Z     contiguous=True,
2025-05-07T20:32:16.3664808Z     compiled=False,
2025-05-07T20:32:16.3664881Z )
2025-05-07T20:32:16.3665090Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3665255Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.3665374Z 
2025-05-07T20:32:16.3665454Z     @given(
2025-05-07T20:32:16.3665566Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3665659Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3665769Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3665885Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3665993Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3666067Z     )
2025-05-07T20:32:16.3666310Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3666403Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3666474Z         self,
2025-05-07T20:32:16.3666548Z         T: int,
2025-05-07T20:32:16.3666622Z         D: int,
2025-05-07T20:32:16.3666716Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3666800Z         contiguous: bool,
2025-05-07T20:32:16.3666886Z         compiled: bool,
2025-05-07T20:32:16.3666961Z     ) -> None:
2025-05-07T20:32:16.3667057Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3667130Z     
2025-05-07T20:32:16.3667291Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3669038Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3669044Z 
2025-05-07T20:32:16.3669156Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.3669161Z 
2025-05-07T20:32:16.3669265Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3669485Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3669556Z     T=16384,
2025-05-07T20:32:16.3669633Z     D=7168,
2025-05-07T20:32:16.3669711Z     scale_ub=None,
2025-05-07T20:32:16.3669792Z     contiguous=True,
2025-05-07T20:32:16.3669876Z     compiled=False,
2025-05-07T20:32:16.3669947Z )
2025-05-07T20:32:16.3670156Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3670328Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.3670333Z 
2025-05-07T20:32:16.3670409Z     @given(
2025-05-07T20:32:16.3670524Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3670620Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3670731Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3670850Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3670958Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3671035Z     )
2025-05-07T20:32:16.3671278Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3671368Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3671439Z         self,
2025-05-07T20:32:16.3671516Z         T: int,
2025-05-07T20:32:16.3671586Z         D: int,
2025-05-07T20:32:16.3671680Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3671768Z         contiguous: bool,
2025-05-07T20:32:16.3671851Z         compiled: bool,
2025-05-07T20:32:16.3671927Z     ) -> None:
2025-05-07T20:32:16.3672099Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3672171Z     
2025-05-07T20:32:16.3672336Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3674075Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3674154Z 
2025-05-07T20:32:16.3674271Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.3674275Z 
2025-05-07T20:32:16.3674370Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3674591Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3674670Z     T=16384,
2025-05-07T20:32:16.3674739Z     D=7168,
2025-05-07T20:32:16.3674819Z     scale_ub=1200.0,
2025-05-07T20:32:16.3674904Z     contiguous=True,
2025-05-07T20:32:16.3674984Z     compiled=False,
2025-05-07T20:32:16.3675059Z )
2025-05-07T20:32:16.3675270Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3675444Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.3675448Z 
2025-05-07T20:32:16.3675525Z     @given(
2025-05-07T20:32:16.3675637Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3675734Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3675844Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3675957Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3676068Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3676140Z     )
2025-05-07T20:32:16.3676384Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3676474Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3676548Z         self,
2025-05-07T20:32:16.3676620Z         T: int,
2025-05-07T20:32:16.3676696Z         D: int,
2025-05-07T20:32:16.3676792Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3676876Z         contiguous: bool,
2025-05-07T20:32:16.3676965Z         compiled: bool,
2025-05-07T20:32:16.3677041Z     ) -> None:
2025-05-07T20:32:16.3677130Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3677201Z     
2025-05-07T20:32:16.3677362Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3679108Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3679114Z 
2025-05-07T20:32:16.3679226Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.3679231Z 
2025-05-07T20:32:16.3679336Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3679552Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3679628Z     T=128,
2025-05-07T20:32:16.3679708Z     D=5120,
2025-05-07T20:32:16.3679786Z     scale_ub=1200.0,
2025-05-07T20:32:16.3679868Z     contiguous=False,
2025-05-07T20:32:16.3679950Z     compiled=False,
2025-05-07T20:32:16.3680021Z )
2025-05-07T20:32:16.3680231Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3680488Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.3680493Z 
2025-05-07T20:32:16.3680568Z     @given(
2025-05-07T20:32:16.3680687Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3680783Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3680894Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3681009Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3681222Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3681295Z     )
2025-05-07T20:32:16.3681537Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3681629Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3681704Z         self,
2025-05-07T20:32:16.3681782Z         T: int,
2025-05-07T20:32:16.3681854Z         D: int,
2025-05-07T20:32:16.3681948Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3682039Z         contiguous: bool,
2025-05-07T20:32:16.3682126Z         compiled: bool,
2025-05-07T20:32:16.3682205Z     ) -> None:
2025-05-07T20:32:16.3682295Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3682363Z     
2025-05-07T20:32:16.3682529Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3682600Z     
2025-05-07T20:32:16.3682688Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3682815Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3682907Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3682986Z         x0 = x[:, :D]
2025-05-07T20:32:16.3683068Z         x1 = x[:, D:]
2025-05-07T20:32:16.3683140Z     
2025-05-07T20:32:16.3683221Z         if contiguous:
2025-05-07T20:32:16.3683313Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3683400Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3683473Z     
2025-05-07T20:32:16.3683560Z         if scale_ub is not None:
2025-05-07T20:32:16.3683663Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3683799Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3683869Z             )
2025-05-07T20:32:16.3683943Z         else:
2025-05-07T20:32:16.3684039Z             scale_ub_tensor = None
2025-05-07T20:32:16.3684108Z     
2025-05-07T20:32:16.3684233Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3684323Z             op = silu_mul_quant
2025-05-07T20:32:16.3684404Z             if compiled:
2025-05-07T20:32:16.3684506Z                 op = torch.compile(op)
2025-05-07T20:32:16.3684610Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3684683Z     
2025-05-07T20:32:16.3684774Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3684778Z 
2025-05-07T20:32:16.3684872Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3684996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3685094Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3685191Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3685686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3685785Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3686137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3686355Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3686693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3686785Z     kernel = self.compile(
2025-05-07T20:32:16.3687162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3687335Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3687459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3687468Z 
2025-05-07T20:32:16.3687749Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a41f592e0>
2025-05-07T20:32:16.3688570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3689066Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a41fc7560>}
2025-05-07T20:32:16.3689876Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3690064Z context = <triton._C.libtriton.ir.context object at 0x7f6a41eba530>
2025-05-07T20:32:16.3690069Z 
2025-05-07T20:32:16.3690232Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3690486Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3690591Z                            module_map=module_map)
2025-05-07T20:32:16.3690751Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3690845Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3690923Z E       ^
2025-05-07T20:32:16.3691272Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3691276Z 
2025-05-07T20:32:16.3691683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3691688Z 
2025-05-07T20:32:16.3691788Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3692004Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3692083Z     T=2048,
2025-05-07T20:32:16.3692161Z     D=7168,
2025-05-07T20:32:16.3692244Z     scale_ub=None,
2025-05-07T20:32:16.3692327Z     contiguous=False,
2025-05-07T20:32:16.3692408Z     compiled=False,
2025-05-07T20:32:16.3692479Z )
2025-05-07T20:32:16.3692692Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3692859Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.3692864Z 
2025-05-07T20:32:16.3692949Z     @given(
2025-05-07T20:32:16.3693116Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3693213Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3693326Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3693439Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3693550Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3693617Z     )
2025-05-07T20:32:16.3693857Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3693954Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3694029Z         self,
2025-05-07T20:32:16.3694102Z         T: int,
2025-05-07T20:32:16.3694176Z         D: int,
2025-05-07T20:32:16.3694272Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3694358Z         contiguous: bool,
2025-05-07T20:32:16.3694448Z         compiled: bool,
2025-05-07T20:32:16.3694521Z     ) -> None:
2025-05-07T20:32:16.3694611Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3694687Z     
2025-05-07T20:32:16.3694852Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3696688Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3696695Z 
2025-05-07T20:32:16.3696808Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.3696812Z 
2025-05-07T20:32:16.3696914Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3697131Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3697281Z     T=128,
2025-05-07T20:32:16.3697362Z     D=7168,
2025-05-07T20:32:16.3697444Z     scale_ub=1200.0,
2025-05-07T20:32:16.3697524Z     contiguous=True,
2025-05-07T20:32:16.3697606Z     compiled=True,
2025-05-07T20:32:16.3697677Z )
2025-05-07T20:32:16.3697888Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3698053Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.3698058Z 
2025-05-07T20:32:16.3698131Z     @given(
2025-05-07T20:32:16.3698256Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3698353Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3698463Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3698578Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3698686Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3698757Z     )
2025-05-07T20:32:16.3699004Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3699095Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3699170Z         self,
2025-05-07T20:32:16.3699247Z         T: int,
2025-05-07T20:32:16.3699321Z         D: int,
2025-05-07T20:32:16.3699419Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3699505Z         contiguous: bool,
2025-05-07T20:32:16.3699586Z         compiled: bool,
2025-05-07T20:32:16.3699662Z     ) -> None:
2025-05-07T20:32:16.3699750Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3699823Z     
2025-05-07T20:32:16.3699990Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3700061Z     
2025-05-07T20:32:16.3700149Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3700275Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3700359Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3700437Z         x0 = x[:, :D]
2025-05-07T20:32:16.3700517Z         x1 = x[:, D:]
2025-05-07T20:32:16.3700592Z     
2025-05-07T20:32:16.3700672Z         if contiguous:
2025-05-07T20:32:16.3700763Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3700848Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3700922Z     
2025-05-07T20:32:16.3701009Z         if scale_ub is not None:
2025-05-07T20:32:16.3701110Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3701242Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3701314Z             )
2025-05-07T20:32:16.3701388Z         else:
2025-05-07T20:32:16.3701488Z             scale_ub_tensor = None
2025-05-07T20:32:16.3701560Z     
2025-05-07T20:32:16.3701685Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3701773Z             op = silu_mul_quant
2025-05-07T20:32:16.3701853Z             if compiled:
2025-05-07T20:32:16.3701948Z                 op = torch.compile(op)
2025-05-07T20:32:16.3702055Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3702131Z     
2025-05-07T20:32:16.3702220Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3702225Z 
2025-05-07T20:32:16.3702318Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3702443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3702544Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3702641Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3703003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3703177Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3703662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3703761Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3704109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3704327Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3704734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3704826Z     kernel = self.compile(
2025-05-07T20:32:16.3705200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3705372Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3705500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3705504Z 
2025-05-07T20:32:16.3705708Z self = <triton.compiler.compiler.ASTSource object at 0x7f6a41e2e270>
2025-05-07T20:32:16.3706470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3706975Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f6aad81d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6a41e30e00>}
2025-05-07T20:32:16.3707705Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3707895Z context = <triton._C.libtriton.ir.context object at 0x7f6a41c776b0>
2025-05-07T20:32:16.3707899Z 
2025-05-07T20:32:16.3708066Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3708338Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3708458Z                            module_map=module_map)
2025-05-07T20:32:16.3708637Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3708731Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3708814Z E       ^
2025-05-07T20:32:16.3709163Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3709167Z 
2025-05-07T20:32:16.3709568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3709576Z 
2025-05-07T20:32:16.3709676Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3714091Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3714179Z     T=128,
2025-05-07T20:32:16.3714261Z     D=7168,
2025-05-07T20:32:16.3714342Z     scale_ub=1200.0,
2025-05-07T20:32:16.3714422Z     contiguous=True,
2025-05-07T20:32:16.3714508Z     compiled=False,
2025-05-07T20:32:16.3714579Z )
2025-05-07T20:32:16.3714797Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3714970Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.3714983Z 
2025-05-07T20:32:16.3715059Z     @given(
2025-05-07T20:32:16.3715177Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3715280Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3715394Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3715512Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3715623Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3715698Z     )
2025-05-07T20:32:16.3716051Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3716149Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3716225Z         self,
2025-05-07T20:32:16.3716304Z         T: int,
2025-05-07T20:32:16.3716380Z         D: int,
2025-05-07T20:32:16.3716479Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3716574Z         contiguous: bool,
2025-05-07T20:32:16.3716658Z         compiled: bool,
2025-05-07T20:32:16.3716814Z     ) -> None:
2025-05-07T20:32:16.3716914Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3716984Z     
2025-05-07T20:32:16.3717156Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3717228Z     
2025-05-07T20:32:16.3717318Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3717449Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3719270Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3719281Z 
2025-05-07T20:32:16.3719401Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:16.3719406Z 
2025-05-07T20:32:16.3719506Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3719727Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3719804Z     T=128,
2025-05-07T20:32:16.3719880Z     D=5120,
2025-05-07T20:32:16.3719959Z     scale_ub=1200.0,
2025-05-07T20:32:16.3720047Z     contiguous=True,
2025-05-07T20:32:16.3720126Z     compiled=True,
2025-05-07T20:32:16.3720197Z )
2025-05-07T20:32:16.3720415Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3720580Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.3720584Z 
2025-05-07T20:32:16.3720661Z     @given(
2025-05-07T20:32:16.3720774Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3720869Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3720981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3721098Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3721208Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3721286Z     )
2025-05-07T20:32:16.3721526Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3721621Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3721693Z         self,
2025-05-07T20:32:16.3721769Z         T: int,
2025-05-07T20:32:16.3721847Z         D: int,
2025-05-07T20:32:16.3721947Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3722035Z         contiguous: bool,
2025-05-07T20:32:16.3722121Z         compiled: bool,
2025-05-07T20:32:16.3722197Z     ) -> None:
2025-05-07T20:32:16.3722290Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3722360Z     
2025-05-07T20:32:16.3722524Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3722595Z     
2025-05-07T20:32:16.3722687Z >       x_sign = torch.sign(x)
2025-05-07T20:32:16.3724509Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3724516Z 
2025-05-07T20:32:16.3724635Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:16.3724640Z 
2025-05-07T20:32:16.3724739Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3724958Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3725030Z     T=128,
2025-05-07T20:32:16.3725103Z     D=7168,
2025-05-07T20:32:16.3725262Z     scale_ub=None,
2025-05-07T20:32:16.3725345Z     contiguous=True,
2025-05-07T20:32:16.3725425Z     compiled=True,
2025-05-07T20:32:16.3725497Z )
2025-05-07T20:32:16.3725708Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3725868Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.3725878Z 
2025-05-07T20:32:16.3725951Z     @given(
2025-05-07T20:32:16.3726065Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3726171Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3726282Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3726398Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3726511Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3726585Z     )
2025-05-07T20:32:16.3726822Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3726914Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3726994Z         self,
2025-05-07T20:32:16.3727069Z         T: int,
2025-05-07T20:32:16.3727146Z         D: int,
2025-05-07T20:32:16.3727240Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3727329Z         contiguous: bool,
2025-05-07T20:32:16.3727413Z         compiled: bool,
2025-05-07T20:32:16.3727486Z     ) -> None:
2025-05-07T20:32:16.3727579Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3727649Z     
2025-05-07T20:32:16.3727811Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3729606Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.3729616Z 
2025-05-07T20:32:16.3729729Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.3729863Z =============================== warnings summary ===============================
2025-05-07T20:32:16.3730167Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:16.3730467Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:16.3730762Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:16.3731625Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:16.3731858Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:16.3731863Z 
2025-05-07T20:32:16.3732036Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings
2025-05-07T20:32:16.3733425Z   /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844.
2025-05-07T20:32:16.3733613Z     torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3)
2025-05-07T20:32:16.3733618Z 
2025-05-07T20:32:16.3733825Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:16.3733991Z ================== 1 failed, 1 passed, 13 warnings in 20.32s ===================
2025-05-07T20:32:18.0898110Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:18.1516326Z 
2025-05-07T20:32:18.1516700Z [TEST] Some tests FAILED.  Re-attempting only FAILED tests: ./moe/activation_test.py
2025-05-07T20:32:18.1517056Z 
2025-05-07T20:32:18.1517063Z 
2025-05-07T20:32:18.1538826Z [EXEC] [ATTEMPT 0/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:20.3100239Z ============================= test session starts ==============================
2025-05-07T20:32:20.3100881Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:20.3101397Z cachedir: .pytest_cache
2025-05-07T20:32:20.3101960Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:20.3102691Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:20.3103094Z plugins: hypothesis-6.131.14
2025-05-07T20:32:21.9279186Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:22.0387083Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:22.0387491Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:22.0387716Z 
2025-05-07T20:32:24.1160932Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:24.1162165Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:32:24.1163493Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:24.1164934Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:24.1165904Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:24.1167191Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:24.1168557Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.1169532Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:24.1170743Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:24.1172453Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.1173608Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:24.1174868Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:24.1176275Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:32:24.1177484Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:24.1178666Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:32:24.1179486Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:24.1180493Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:24.1181675Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:32:24.1182453Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^
2025-05-07T20:32:24.1183648Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:24.1184910Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:24.1186006Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:24.1187043Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:32:24.1188195Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:24.1189529Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:24.1190581Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.1191479Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.1192209Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:32:24.1193205Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.1320853Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:24.1321904Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:32:24.1323218Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:24.1324738Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:24.1325694Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:24.1326985Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:24.1328346Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.1329317Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:24.1330527Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:24.1331884Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.1332937Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:24.1334297Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:24.1335528Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:32:24.1336725Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:24.1337923Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:32:24.1338734Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:24.1339743Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:24.1340750Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:32:24.1341531Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^
2025-05-07T20:32:24.1342714Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:24.1344073Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:24.1345175Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:24.1346208Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:32:24.1347451Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:24.1348778Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:24.1349831Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.1350735Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.1351506Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:32:24.1352518Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.5513789Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.5514427Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.5514840Z     T=1,
2025-05-07T20:32:24.5515041Z     D=5120,
2025-05-07T20:32:24.5515247Z     scale_ub=None,
2025-05-07T20:32:24.5515470Z     contiguous=True,
2025-05-07T20:32:24.5515701Z     compiled=True,
2025-05-07T20:32:24.5515905Z )
2025-05-07T20:32:24.5516231Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.5516719Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:24.5516978Z 
2025-05-07T20:32:24.5517064Z     @given(
2025-05-07T20:32:24.5517308Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.5517646Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.5517956Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.5518289Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.5518612Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.5518899Z     )
2025-05-07T20:32:24.5519251Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.5519692Z     def test_silu_mul_quant(
2025-05-07T20:32:24.5519933Z         self,
2025-05-07T20:32:24.5520129Z         T: int,
2025-05-07T20:32:24.5520349Z         D: int,
2025-05-07T20:32:24.5520575Z         scale_ub: Optional[float],
2025-05-07T20:32:24.5520853Z         contiguous: bool,
2025-05-07T20:32:24.5521089Z         compiled: bool,
2025-05-07T20:32:24.5521325Z     ) -> None:
2025-05-07T20:32:24.5521546Z         torch.manual_seed(2025)
2025-05-07T20:32:24.5521792Z     
2025-05-07T20:32:24.5522074Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.5522417Z     
2025-05-07T20:32:24.5522609Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.5522908Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.5523224Z         x = x_sign * x_clamp
2025-05-07T20:32:24.5523462Z         x0 = x[:, :D]
2025-05-07T20:32:24.5523683Z         x1 = x[:, D:]
2025-05-07T20:32:24.5523894Z     
2025-05-07T20:32:24.5524080Z         if contiguous:
2025-05-07T20:32:24.5524583Z             x0 = x0.contiguous()
2025-05-07T20:32:24.5524851Z             x1 = x1.contiguous()
2025-05-07T20:32:24.5525088Z     
2025-05-07T20:32:24.5525284Z         if scale_ub is not None:
2025-05-07T20:32:24.5525557Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.5525894Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.5526198Z             )
2025-05-07T20:32:24.5526392Z         else:
2025-05-07T20:32:24.5526764Z             scale_ub_tensor = None
2025-05-07T20:32:24.5527017Z     
2025-05-07T20:32:24.5527253Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.5527567Z             op = silu_mul_quant
2025-05-07T20:32:24.5527814Z             if compiled:
2025-05-07T20:32:24.5528065Z                 op = torch.compile(op)
2025-05-07T20:32:24.5528364Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.5528635Z     
2025-05-07T20:32:24.5528832Z         y_fp8, y_scale = fn()
2025-05-07T20:32:24.5529125Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:24.5529413Z     
2025-05-07T20:32:24.5529654Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.5529990Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:24.5530280Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:24.5530594Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:24.5530953Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:24.5531271Z     
2025-05-07T20:32:24.5531477Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:24.5531674Z 
2025-05-07T20:32:24.5531779Z moe/activation_test.py:126: 
2025-05-07T20:32:24.5532078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.5532408Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:24.5532736Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:24.5533592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:24.5534347Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:24.5534891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.5535571Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.5536268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:24.5536980Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:24.5537705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:24.5538341Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:24.5538945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:24.5539454Z     fn()
2025-05-07T20:32:24.5539967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:24.5540550Z     self.fn.run(
2025-05-07T20:32:24.5541024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.5541548Z     kernel = self.compile(
2025-05-07T20:32:24.5542097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.5542751Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.5543144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.5543378Z 
2025-05-07T20:32:24.5543588Z self = <triton.compiler.compiler.ASTSource object at 0x7fb10b153170>
2025-05-07T20:32:24.5544757Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.5546136Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb1098d4c20>}
2025-05-07T20:32:24.5547461Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.5548551Z context = <triton._C.libtriton.ir.context object at 0x7fb10b06b2f0>
2025-05-07T20:32:24.5548841Z 
2025-05-07T20:32:24.5549005Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.5549525Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.5549999Z                            module_map=module_map)
2025-05-07T20:32:24.5550361Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.5550726Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:24.5551001Z E       ^
2025-05-07T20:32:24.5551489Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.5551969Z 
2025-05-07T20:32:24.5552386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.5552896Z 
2025-05-07T20:32:24.5553002Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.5553420Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.5553815Z     T=2048,
2025-05-07T20:32:24.5554008Z     D=5120,
2025-05-07T20:32:24.5554210Z     scale_ub=1200.0,
2025-05-07T20:32:24.5554432Z     contiguous=True,
2025-05-07T20:32:24.5554658Z     compiled=False,
2025-05-07T20:32:24.5554869Z )
2025-05-07T20:32:25.0007414Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:25.0009550Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:32:25.0012173Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:25.0013727Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:25.0014701Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.0015990Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:25.0017356Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.0018333Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.0019536Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:25.0021265Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.0022359Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.0023798Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:25.0025026Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:32:25.0026233Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:25.0027418Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:32:25.0028232Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.0029245Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:25.0030248Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:32:25.0031022Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^
2025-05-07T20:32:25.0032273Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:25.0033530Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:25.0034638Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:25.0035674Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:32:25.0036828Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:25.0038163Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:25.0039208Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.0040104Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.0040830Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:32:25.0041831Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.0901010Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:25.0902084Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:32:25.0903404Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:25.0904968Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:25.0905926Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.0907231Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:25.0908591Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.0909563Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.0910785Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:25.0912132Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.0913193Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.0914460Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:25.0915695Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:32:25.0916896Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:25.0918084Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:32:25.0918901Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.0919910Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:25.0920917Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:32:25.0921717Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^
2025-05-07T20:32:25.0923020Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:25.0924292Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:25.0925391Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:25.0926500Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:32:25.0927655Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:25.0928996Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:25.0930042Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.0930936Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.0931669Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:32:25.0932726Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.5499311Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.5499862Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.5500144Z 
2025-05-07T20:32:25.5500236Z     @given(
2025-05-07T20:32:25.5500497Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.5500828Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.5501149Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.5501484Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.5501820Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.5502119Z     )
2025-05-07T20:32:25.5502485Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.5502934Z     def test_silu_mul_quant(
2025-05-07T20:32:25.5503191Z         self,
2025-05-07T20:32:25.5503399Z         T: int,
2025-05-07T20:32:25.5503601Z         D: int,
2025-05-07T20:32:25.5503835Z         scale_ub: Optional[float],
2025-05-07T20:32:25.5504114Z         contiguous: bool,
2025-05-07T20:32:25.5504359Z         compiled: bool,
2025-05-07T20:32:25.5504593Z     ) -> None:
2025-05-07T20:32:25.5504823Z         torch.manual_seed(2025)
2025-05-07T20:32:25.5505067Z     
2025-05-07T20:32:25.5505353Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.5505702Z     
2025-05-07T20:32:25.5505900Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.5506206Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.5506523Z         x = x_sign * x_clamp
2025-05-07T20:32:25.5506770Z         x0 = x[:, :D]
2025-05-07T20:32:25.5506997Z         x1 = x[:, D:]
2025-05-07T20:32:25.5507228Z     
2025-05-07T20:32:25.5507422Z         if contiguous:
2025-05-07T20:32:25.5507669Z             x0 = x0.contiguous()
2025-05-07T20:32:25.5507937Z             x1 = x1.contiguous()
2025-05-07T20:32:25.5508177Z     
2025-05-07T20:32:25.5508386Z         if scale_ub is not None:
2025-05-07T20:32:25.5508678Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.5509022Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.5509358Z             )
2025-05-07T20:32:25.5509562Z         else:
2025-05-07T20:32:25.5510000Z             scale_ub_tensor = None
2025-05-07T20:32:25.5517402Z     
2025-05-07T20:32:25.5517660Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.5517978Z             op = silu_mul_quant
2025-05-07T20:32:25.5518232Z             if compiled:
2025-05-07T20:32:25.5518485Z                 op = torch.compile(op)
2025-05-07T20:32:25.5518779Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.5519247Z     
2025-05-07T20:32:25.5519448Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.5519612Z 
2025-05-07T20:32:25.5519715Z moe/activation_test.py:117: 
2025-05-07T20:32:25.5520022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.5520359Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.5520639Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.5521329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.5522074Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.5522607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.5523279Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.5523946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.5524481Z     kernel = self.compile(
2025-05-07T20:32:25.5525025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.5525671Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.5526068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.5526294Z 
2025-05-07T20:32:25.5526505Z self = <triton.compiler.compiler.ASTSource object at 0x7fb109a6c890>
2025-05-07T20:32:25.5527576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.5528933Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb109990180>}
2025-05-07T20:32:25.5530261Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.5531276Z context = <triton._C.libtriton.ir.context object at 0x7fb1094c9970>
2025-05-07T20:32:25.5531561Z 
2025-05-07T20:32:25.5531732Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.5532241Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.5532708Z                            module_map=module_map)
2025-05-07T20:32:25.5533163Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.5533513Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.5533772Z E       ^
2025-05-07T20:32:25.5534234Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.5534681Z 
2025-05-07T20:32:25.5535097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.5535600Z 
2025-05-07T20:32:25.5535708Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.5536109Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.5536510Z     T=2048,
2025-05-07T20:32:25.5536703Z     D=5120,
2025-05-07T20:32:25.5536889Z     scale_ub=1200.0,
2025-05-07T20:32:25.5537111Z     contiguous=True,
2025-05-07T20:32:25.5537419Z     compiled=True,
2025-05-07T20:32:25.5537621Z )
2025-05-07T20:32:25.5537938Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.5538433Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.5538698Z 
2025-05-07T20:32:25.5538777Z     @given(
2025-05-07T20:32:25.5539010Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.5539394Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.5539705Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.5540026Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.5540353Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.5540640Z     )
2025-05-07T20:32:25.5540981Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.5541423Z     def test_silu_mul_quant(
2025-05-07T20:32:25.5541669Z         self,
2025-05-07T20:32:25.5541864Z         T: int,
2025-05-07T20:32:25.5542067Z         D: int,
2025-05-07T20:32:25.5542289Z         scale_ub: Optional[float],
2025-05-07T20:32:25.5542553Z         contiguous: bool,
2025-05-07T20:32:25.5542792Z         compiled: bool,
2025-05-07T20:32:25.5543014Z     ) -> None:
2025-05-07T20:32:25.5543225Z         torch.manual_seed(2025)
2025-05-07T20:32:25.5543474Z     
2025-05-07T20:32:25.5543750Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.5544101Z     
2025-05-07T20:32:25.5544290Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.5544587Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.5544903Z         x = x_sign * x_clamp
2025-05-07T20:32:25.5545142Z         x0 = x[:, :D]
2025-05-07T20:32:25.5545369Z         x1 = x[:, D:]
2025-05-07T20:32:25.5545585Z     
2025-05-07T20:32:25.5545766Z         if contiguous:
2025-05-07T20:32:25.5546005Z             x0 = x0.contiguous()
2025-05-07T20:32:25.5546263Z             x1 = x1.contiguous()
2025-05-07T20:32:25.5546506Z     
2025-05-07T20:32:25.5546699Z         if scale_ub is not None:
2025-05-07T20:32:25.5546979Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.5547308Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.5547619Z             )
2025-05-07T20:32:25.5547814Z         else:
2025-05-07T20:32:25.5548019Z             scale_ub_tensor = None
2025-05-07T20:32:25.5548270Z     
2025-05-07T20:32:25.5548510Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.5548817Z             op = silu_mul_quant
2025-05-07T20:32:25.5549072Z             if compiled:
2025-05-07T20:32:25.5549321Z                 op = torch.compile(op)
2025-05-07T20:32:25.5549620Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.5549890Z     
2025-05-07T20:32:25.5550089Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.5550381Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.5550660Z     
2025-05-07T20:32:25.5550905Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.5551242Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.5551526Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.5551844Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.5552208Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.5552514Z     
2025-05-07T20:32:25.5552720Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.5552917Z 
2025-05-07T20:32:25.5553015Z moe/activation_test.py:126: 
2025-05-07T20:32:25.5553318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.5553650Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.5553987Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.5554772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.5555595Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.5556143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.5556820Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.5557499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.5558279Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.5559000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.5560021Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.5560624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.5561126Z     fn()
2025-05-07T20:32:25.5561639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.5562220Z     self.fn.run(
2025-05-07T20:32:25.5562730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.5563264Z     kernel = self.compile(
2025-05-07T20:32:25.5563802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.5564456Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.5564845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.5565076Z 
2025-05-07T20:32:25.5565281Z self = <triton.compiler.compiler.ASTSource object at 0x7fb109412450>
2025-05-07T20:32:25.5566358Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.5567713Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10852d260>}
2025-05-07T20:32:25.5569035Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.5570059Z context = <triton._C.libtriton.ir.context object at 0x7fb103c54eb0>
2025-05-07T20:32:25.5570350Z 
2025-05-07T20:32:25.5570514Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.5571034Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.5571493Z                            module_map=module_map)
2025-05-07T20:32:25.5571862Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.5572220Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.5572482Z E       ^
2025-05-07T20:32:25.5572949Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.5573480Z 
2025-05-07T20:32:25.5573891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.5574406Z 
2025-05-07T20:32:25.5574516Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.5574923Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.5575323Z     T=16384,
2025-05-07T20:32:25.5575525Z     D=7168,
2025-05-07T20:32:25.5575721Z     scale_ub=1200.0,
2025-05-07T20:32:25.5575940Z     contiguous=False,
2025-05-07T20:32:25.5576168Z     compiled=False,
2025-05-07T20:32:25.5576372Z )
2025-05-07T20:32:25.8020796Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:25.8021919Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:25.8023263Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:25.8024928Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:25.8025890Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.8027181Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:25.8028548Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.8029528Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.8030739Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:25.8032100Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.8033155Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.8034416Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:25.8035651Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:25.8036859Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:25.8038048Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:25.8038863Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.8039868Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:25.8040882Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:25.8041669Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^
2025-05-07T20:32:25.8042937Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:25.8044209Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:25.8045311Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:25.8046409Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:25.8047556Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:25.8048900Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:25.8049951Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.8050848Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.8051584Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:25.8052579Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.8637799Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:25.8639113Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:25.8640421Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:25.8641833Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:25.8642849Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.8644134Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:25.8645488Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.8646455Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.8647664Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:25.8649017Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.8650273Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.8651531Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:25.8652897Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:25.8654172Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:25.8655372Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:25.8656185Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.8657194Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:25.8658193Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:25.8658978Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^
2025-05-07T20:32:25.8660329Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:25.8661591Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:25.8662689Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:25.8663711Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:25.8664875Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:25.8666202Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:25.8667256Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.8668152Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.8668887Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:25.8669889Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.3637827Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.3638346Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:26.3638629Z 
2025-05-07T20:32:26.3638716Z     @given(
2025-05-07T20:32:26.3638980Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.3639466Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.3639792Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.3640129Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.3640449Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.3640733Z     )
2025-05-07T20:32:26.3641085Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.3641639Z     def test_silu_mul_quant(
2025-05-07T20:32:26.3641882Z         self,
2025-05-07T20:32:26.3642111Z         T: int,
2025-05-07T20:32:26.3642323Z         D: int,
2025-05-07T20:32:26.3642545Z         scale_ub: Optional[float],
2025-05-07T20:32:26.3642820Z         contiguous: bool,
2025-05-07T20:32:26.3643063Z         compiled: bool,
2025-05-07T20:32:26.3643280Z     ) -> None:
2025-05-07T20:32:26.3643499Z         torch.manual_seed(2025)
2025-05-07T20:32:26.3643743Z     
2025-05-07T20:32:26.3644021Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.3644362Z     
2025-05-07T20:32:26.3644557Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.3644841Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.3645150Z         x = x_sign * x_clamp
2025-05-07T20:32:26.3645400Z         x0 = x[:, :D]
2025-05-07T20:32:26.3645611Z         x1 = x[:, D:]
2025-05-07T20:32:26.3645830Z     
2025-05-07T20:32:26.3646031Z         if contiguous:
2025-05-07T20:32:26.3646265Z             x0 = x0.contiguous()
2025-05-07T20:32:26.3646524Z             x1 = x1.contiguous()
2025-05-07T20:32:26.3646769Z     
2025-05-07T20:32:26.3646959Z         if scale_ub is not None:
2025-05-07T20:32:26.3647240Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.3647582Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.3647895Z             )
2025-05-07T20:32:26.3648085Z         else:
2025-05-07T20:32:26.3648300Z             scale_ub_tensor = None
2025-05-07T20:32:26.3648555Z     
2025-05-07T20:32:26.3648796Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.3649113Z             op = silu_mul_quant
2025-05-07T20:32:26.3649365Z             if compiled:
2025-05-07T20:32:26.3649610Z                 op = torch.compile(op)
2025-05-07T20:32:26.3649907Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.3650180Z     
2025-05-07T20:32:26.3650371Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.3650550Z 
2025-05-07T20:32:26.3650648Z moe/activation_test.py:117: 
2025-05-07T20:32:26.3650944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.3651264Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.3651545Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.3652232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.3652913Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.3653534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.3654210Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.3654868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.3655393Z     kernel = self.compile(
2025-05-07T20:32:26.3655929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.3656575Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.3656976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.3657200Z 
2025-05-07T20:32:26.3657406Z self = <triton.compiler.compiler.ASTSource object at 0x7fb109b17710>
2025-05-07T20:32:26.3658561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.3660067Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10852f060>}
2025-05-07T20:32:26.3661390Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.3662566Z context = <triton._C.libtriton.ir.context object at 0x7fb103c88030>
2025-05-07T20:32:26.3662847Z 
2025-05-07T20:32:26.3663014Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.3663531Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.3663993Z                            module_map=module_map)
2025-05-07T20:32:26.3664356Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.3664712Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.3664974Z E       ^
2025-05-07T20:32:26.3665440Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.3665887Z 
2025-05-07T20:32:26.3666299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.3666815Z 
2025-05-07T20:32:26.3666921Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.3667337Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.3667743Z     T=1,
2025-05-07T20:32:26.3667929Z     D=7168,
2025-05-07T20:32:26.3668127Z     scale_ub=None,
2025-05-07T20:32:26.3668343Z     contiguous=True,
2025-05-07T20:32:26.3668563Z     compiled=True,
2025-05-07T20:32:26.3668779Z )
2025-05-07T20:32:26.3669105Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.3669612Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:26.3669868Z 
2025-05-07T20:32:26.3669947Z     @given(
2025-05-07T20:32:26.3670180Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.3670492Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.3670797Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.3671131Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.3671461Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.3671751Z     )
2025-05-07T20:32:26.3672101Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.3672594Z     def test_silu_mul_quant(
2025-05-07T20:32:26.3672838Z         self,
2025-05-07T20:32:26.3673036Z         T: int,
2025-05-07T20:32:26.3673250Z         D: int,
2025-05-07T20:32:26.3673479Z         scale_ub: Optional[float],
2025-05-07T20:32:26.3673746Z         contiguous: bool,
2025-05-07T20:32:26.3673996Z         compiled: bool,
2025-05-07T20:32:26.3674227Z     ) -> None:
2025-05-07T20:32:26.3674438Z         torch.manual_seed(2025)
2025-05-07T20:32:26.3674689Z     
2025-05-07T20:32:26.3674962Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.3675304Z     
2025-05-07T20:32:26.3675515Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.3675825Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.3676132Z         x = x_sign * x_clamp
2025-05-07T20:32:26.3676382Z         x0 = x[:, :D]
2025-05-07T20:32:26.3676611Z         x1 = x[:, D:]
2025-05-07T20:32:26.3676815Z     
2025-05-07T20:32:26.3677010Z         if contiguous:
2025-05-07T20:32:26.3677255Z             x0 = x0.contiguous()
2025-05-07T20:32:26.3677517Z             x1 = x1.contiguous()
2025-05-07T20:32:26.3677751Z     
2025-05-07T20:32:26.3677944Z         if scale_ub is not None:
2025-05-07T20:32:26.3678343Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.3678681Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.3678995Z             )
2025-05-07T20:32:26.3679196Z         else:
2025-05-07T20:32:26.3679407Z             scale_ub_tensor = None
2025-05-07T20:32:26.3679661Z     
2025-05-07T20:32:26.3679900Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.3680287Z             op = silu_mul_quant
2025-05-07T20:32:26.3680543Z             if compiled:
2025-05-07T20:32:26.3680796Z                 op = torch.compile(op)
2025-05-07T20:32:26.3681087Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.3681369Z     
2025-05-07T20:32:26.3681567Z         y_fp8, y_scale = fn()
2025-05-07T20:32:26.3681847Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:26.3682139Z     
2025-05-07T20:32:26.3682388Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.3682733Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:26.3683024Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:26.3683345Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:26.3683708Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:26.3684015Z     
2025-05-07T20:32:26.3684227Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:26.3684421Z 
2025-05-07T20:32:26.3684529Z moe/activation_test.py:126: 
2025-05-07T20:32:26.3684829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.3685167Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:26.3685499Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:26.3686290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:26.3687035Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:26.3687591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.3688268Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.3688952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:26.3689662Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:26.3690390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:26.3691025Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:26.3691615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:26.3692142Z     fn()
2025-05-07T20:32:26.3692809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:26.3693455Z     self.fn.run(
2025-05-07T20:32:26.3693920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.3694461Z     kernel = self.compile(
2025-05-07T20:32:26.3695003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.3695654Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.3696058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.3696288Z 
2025-05-07T20:32:26.3696495Z self = <triton.compiler.compiler.ASTSource object at 0x7fb10841e000>
2025-05-07T20:32:26.3697564Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.3699053Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb103820fe0>}
2025-05-07T20:32:26.3700378Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.3701466Z context = <triton._C.libtriton.ir.context object at 0x7fb103a70bb0>
2025-05-07T20:32:26.3701752Z 
2025-05-07T20:32:26.3701924Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.3702439Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.3702901Z                            module_map=module_map)
2025-05-07T20:32:26.3703372Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.3703796Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:26.3711251Z E       ^
2025-05-07T20:32:26.3711738Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.3712184Z 
2025-05-07T20:32:26.3712609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.3713117Z 
2025-05-07T20:32:26.3713225Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.3713644Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.3714041Z     T=4096,
2025-05-07T20:32:26.3714235Z     D=5120,
2025-05-07T20:32:26.3714426Z     scale_ub=None,
2025-05-07T20:32:26.3714643Z     contiguous=False,
2025-05-07T20:32:26.3714870Z     compiled=False,
2025-05-07T20:32:26.3715066Z )
2025-05-07T20:32:26.8225246Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:26.8226462Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:26.8227798Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:26.8229227Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:26.8230188Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:26.8231490Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:26.8232907Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.8233881Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:26.8235102Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:26.8236598Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.8237949Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:26.8239225Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:26.8240567Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:26.8241764Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:26.8243001Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:26.8243817Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:26.8244825Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:26.8245836Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:26.8246621Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^
2025-05-07T20:32:26.8247804Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:26.8249067Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:26.8250171Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:26.8251190Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:26.8252343Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:26.8253852Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:26.8254907Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.8255818Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.8256551Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:26.8257559Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.0451838Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:27.0453119Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:27.0454453Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:27.0455856Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:27.0456945Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:27.0458237Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:27.0459771Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.0460749Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:27.0462039Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:27.0463784Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.0465087Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:27.0466661Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:27.0468187Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:27.0469684Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:27.0471150Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:27.0472145Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:27.0473391Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:27.0474625Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:27.0475578Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^
2025-05-07T20:32:27.0477059Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:27.0478629Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:27.0480110Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:27.0481379Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:27.0482820Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:27.0484597Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:27.0485893Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.0486994Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.0487877Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:27.0489108Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.6382173Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.6382721Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:27.6383030Z 
2025-05-07T20:32:27.6383156Z     @given(
2025-05-07T20:32:27.6383428Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.6383744Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.6384062Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.6384397Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.6384718Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.6385004Z     )
2025-05-07T20:32:27.6385354Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.6385789Z     def test_silu_mul_quant(
2025-05-07T20:32:27.6386042Z         self,
2025-05-07T20:32:27.6386254Z         T: int,
2025-05-07T20:32:27.6386448Z         D: int,
2025-05-07T20:32:27.6386669Z         scale_ub: Optional[float],
2025-05-07T20:32:27.6386941Z         contiguous: bool,
2025-05-07T20:32:27.6387177Z         compiled: bool,
2025-05-07T20:32:27.6387409Z     ) -> None:
2025-05-07T20:32:27.6387633Z         torch.manual_seed(2025)
2025-05-07T20:32:27.6387874Z     
2025-05-07T20:32:27.6388151Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.6388492Z     
2025-05-07T20:32:27.6388684Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.6388985Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.6389302Z         x = x_sign * x_clamp
2025-05-07T20:32:27.6389551Z         x0 = x[:, :D]
2025-05-07T20:32:27.6389766Z         x1 = x[:, D:]
2025-05-07T20:32:27.6389983Z     
2025-05-07T20:32:27.6390177Z         if contiguous:
2025-05-07T20:32:27.6390413Z             x0 = x0.contiguous()
2025-05-07T20:32:27.6390686Z             x1 = x1.contiguous()
2025-05-07T20:32:27.6390936Z     
2025-05-07T20:32:27.6391204Z         if scale_ub is not None:
2025-05-07T20:32:27.6391517Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.6391859Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.6392171Z             )
2025-05-07T20:32:27.6392382Z         else:
2025-05-07T20:32:27.6392642Z             scale_ub_tensor = None
2025-05-07T20:32:27.6392903Z     
2025-05-07T20:32:27.6393147Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.6393669Z             op = silu_mul_quant
2025-05-07T20:32:27.6393926Z             if compiled:
2025-05-07T20:32:27.6394189Z                 op = torch.compile(op)
2025-05-07T20:32:27.6394493Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.6394778Z     
2025-05-07T20:32:27.6394978Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.6395152Z 
2025-05-07T20:32:27.6395258Z moe/activation_test.py:117: 
2025-05-07T20:32:27.6395564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.6396015Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.6396312Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.6397012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.6397703Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.6398249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.6398940Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.6399606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.6400139Z     kernel = self.compile(
2025-05-07T20:32:27.6400689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.6401348Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.6401782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.6402109Z 
2025-05-07T20:32:27.6402347Z self = <triton.compiler.compiler.ASTSource object at 0x7fb10814f9e0>
2025-05-07T20:32:27.6403427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.6404785Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb103822160>}
2025-05-07T20:32:27.6406108Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.6407120Z context = <triton._C.libtriton.ir.context object at 0x7fb103a977f0>
2025-05-07T20:32:27.6407414Z 
2025-05-07T20:32:27.6407583Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.6408101Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.6408572Z                            module_map=module_map)
2025-05-07T20:32:27.6408938Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.6409302Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.6409573Z E       ^
2025-05-07T20:32:27.6410043Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.6410502Z 
2025-05-07T20:32:27.6410917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.6411433Z 
2025-05-07T20:32:27.6411544Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.6411964Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.6412364Z     T=4096,
2025-05-07T20:32:27.6412564Z     D=7168,
2025-05-07T20:32:27.6412766Z     scale_ub=None,
2025-05-07T20:32:27.6413071Z     contiguous=False,
2025-05-07T20:32:27.6413339Z     compiled=False,
2025-05-07T20:32:27.6413556Z )
2025-05-07T20:32:27.6413881Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.6414480Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:27.6414761Z 
2025-05-07T20:32:27.6414843Z     @given(
2025-05-07T20:32:27.6415092Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.6415407Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.6415727Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.6416064Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.6416470Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.6416765Z     )
2025-05-07T20:32:27.6417123Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.6417562Z     def test_silu_mul_quant(
2025-05-07T20:32:27.6417812Z         self,
2025-05-07T20:32:27.6418018Z         T: int,
2025-05-07T20:32:27.6418229Z         D: int,
2025-05-07T20:32:27.6418474Z         scale_ub: Optional[float],
2025-05-07T20:32:27.6418762Z         contiguous: bool,
2025-05-07T20:32:27.6419023Z         compiled: bool,
2025-05-07T20:32:27.6419262Z     ) -> None:
2025-05-07T20:32:27.6419485Z         torch.manual_seed(2025)
2025-05-07T20:32:27.6419740Z     
2025-05-07T20:32:27.6420016Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.6420373Z     
2025-05-07T20:32:27.6420579Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.6420879Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.6421204Z         x = x_sign * x_clamp
2025-05-07T20:32:27.6421463Z         x0 = x[:, :D]
2025-05-07T20:32:27.6421696Z         x1 = x[:, D:]
2025-05-07T20:32:27.6421915Z     
2025-05-07T20:32:27.6422121Z         if contiguous:
2025-05-07T20:32:27.6422367Z             x0 = x0.contiguous()
2025-05-07T20:32:27.6422635Z             x1 = x1.contiguous()
2025-05-07T20:32:27.6422889Z     
2025-05-07T20:32:27.6423099Z         if scale_ub is not None:
2025-05-07T20:32:27.6423379Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.6423732Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.6424052Z             )
2025-05-07T20:32:27.6424251Z         else:
2025-05-07T20:32:27.6424477Z             scale_ub_tensor = None
2025-05-07T20:32:27.6424744Z     
2025-05-07T20:32:27.6424981Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.6425307Z             op = silu_mul_quant
2025-05-07T20:32:27.6425572Z             if compiled:
2025-05-07T20:32:27.6425835Z                 op = torch.compile(op)
2025-05-07T20:32:27.6426148Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.6426438Z     
2025-05-07T20:32:27.6426641Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.6426814Z 
2025-05-07T20:32:27.6426918Z moe/activation_test.py:117: 
2025-05-07T20:32:27.6427229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.6427574Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.6427863Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.6428568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.6429266Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.6429810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.6430500Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.6431174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.6431710Z     kernel = self.compile(
2025-05-07T20:32:27.6432254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.6432959Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.6433356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.6433668Z 
2025-05-07T20:32:27.6433884Z self = <triton.compiler.compiler.ASTSource object at 0x7fb1083c1010>
2025-05-07T20:32:27.6434954Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.6436308Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb103823240>}
2025-05-07T20:32:27.6437706Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.6438724Z context = <triton._C.libtriton.ir.context object at 0x7fb1030badf0>
2025-05-07T20:32:27.6439013Z 
2025-05-07T20:32:27.6439195Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.6439715Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.6440192Z                            module_map=module_map)
2025-05-07T20:32:27.6440572Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.6440932Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.6441203Z E       ^
2025-05-07T20:32:27.6441680Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.6442126Z 
2025-05-07T20:32:27.6442551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.6443098Z 
2025-05-07T20:32:27.6443224Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.6443645Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.6444058Z     T=128,
2025-05-07T20:32:27.6444261Z     D=7168,
2025-05-07T20:32:27.6444468Z     scale_ub=None,
2025-05-07T20:32:27.6444693Z     contiguous=False,
2025-05-07T20:32:27.6444925Z     compiled=True,
2025-05-07T20:32:27.6445133Z )
2025-05-07T20:32:27.7005628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.7006673Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:27.7007208Z 
2025-05-07T20:32:27.7007389Z     @given(
2025-05-07T20:32:27.7007831Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.7008449Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.7009053Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.7009695Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.7010344Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.7010912Z     )
2025-05-07T20:32:27.7011587Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.7012461Z     def test_silu_mul_quant(
2025-05-07T20:32:27.7012824Z         self,
2025-05-07T20:32:27.7013131Z         T: int,
2025-05-07T20:32:27.7013334Z         D: int,
2025-05-07T20:32:27.7013553Z         scale_ub: Optional[float],
2025-05-07T20:32:27.7013824Z         contiguous: bool,
2025-05-07T20:32:27.7014059Z         compiled: bool,
2025-05-07T20:32:27.7014289Z     ) -> None:
2025-05-07T20:32:27.7014506Z         torch.manual_seed(2025)
2025-05-07T20:32:27.7014746Z     
2025-05-07T20:32:27.7015024Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.7015369Z     
2025-05-07T20:32:27.7015564Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.7015854Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.7016155Z         x = x_sign * x_clamp
2025-05-07T20:32:27.7016391Z         x0 = x[:, :D]
2025-05-07T20:32:27.7016608Z         x1 = x[:, D:]
2025-05-07T20:32:27.7016817Z     
2025-05-07T20:32:27.7016997Z         if contiguous:
2025-05-07T20:32:27.7017395Z             x0 = x0.contiguous()
2025-05-07T20:32:27.7017657Z             x1 = x1.contiguous()
2025-05-07T20:32:27.7017892Z     
2025-05-07T20:32:27.7018088Z         if scale_ub is not None:
2025-05-07T20:32:27.7018362Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.7018689Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.7018999Z             )
2025-05-07T20:32:27.7019311Z         else:
2025-05-07T20:32:27.7019526Z             scale_ub_tensor = None
2025-05-07T20:32:27.7019771Z     
2025-05-07T20:32:27.7020004Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.7020316Z             op = silu_mul_quant
2025-05-07T20:32:27.7020560Z             if compiled:
2025-05-07T20:32:27.7020813Z                 op = torch.compile(op)
2025-05-07T20:32:27.7021112Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.7021379Z     
2025-05-07T20:32:27.7021574Z         y_fp8, y_scale = fn()
2025-05-07T20:32:27.7021869Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:27.7022156Z     
2025-05-07T20:32:27.7022403Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.7022733Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:27.7023018Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:27.7023332Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:27.7023688Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:27.7023997Z     
2025-05-07T20:32:27.7024194Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:27.7024388Z 
2025-05-07T20:32:27.7024487Z moe/activation_test.py:126: 
2025-05-07T20:32:27.7024786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.7025114Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:27.7025440Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:27.7026222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:27.7026970Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:27.7027501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.7028171Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.7028854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:27.7029560Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:27.7030279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:27.7030910Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:27.7031514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:27.7032020Z     fn()
2025-05-07T20:32:27.7032639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:27.7033354Z     self.fn.run(
2025-05-07T20:32:27.7033927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.7034577Z     kernel = self.compile(
2025-05-07T20:32:27.7035116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.7035762Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.7036154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.7036382Z 
2025-05-07T20:32:27.7036589Z self = <triton.compiler.compiler.ASTSource object at 0x7fb109a9fdd0>
2025-05-07T20:32:27.7037766Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.7039122Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10822ede0>}
2025-05-07T20:32:27.7040513Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.7041517Z context = <triton._C.libtriton.ir.context object at 0x7fb10371c1f0>
2025-05-07T20:32:27.7041807Z 
2025-05-07T20:32:27.7041973Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.7042599Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.7043182Z                            module_map=module_map)
2025-05-07T20:32:27.7043643Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.7044091Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:27.7044398Z E       ^
2025-05-07T20:32:27.7044859Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.7045319Z 
2025-05-07T20:32:27.7045729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.7046231Z 
2025-05-07T20:32:27.7046344Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.7046753Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.7047149Z     T=128,
2025-05-07T20:32:27.7047346Z     D=7168,
2025-05-07T20:32:27.7047545Z     scale_ub=None,
2025-05-07T20:32:27.7047764Z     contiguous=False,
2025-05-07T20:32:27.7048004Z     compiled=False,
2025-05-07T20:32:27.7048217Z )
2025-05-07T20:32:27.8995915Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.8996522Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:27.8996918Z 
2025-05-07T20:32:27.8997003Z     @given(
2025-05-07T20:32:27.8997238Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.8997551Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.8997869Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.8998204Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.8998534Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.8998814Z     )
2025-05-07T20:32:27.8999164Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.8999603Z     def test_silu_mul_quant(
2025-05-07T20:32:27.8999844Z         self,
2025-05-07T20:32:27.9000041Z         T: int,
2025-05-07T20:32:27.9000243Z         D: int,
2025-05-07T20:32:27.9000458Z         scale_ub: Optional[float],
2025-05-07T20:32:27.9000729Z         contiguous: bool,
2025-05-07T20:32:27.9000973Z         compiled: bool,
2025-05-07T20:32:27.9001191Z     ) -> None:
2025-05-07T20:32:27.9001412Z         torch.manual_seed(2025)
2025-05-07T20:32:27.9001652Z     
2025-05-07T20:32:27.9001922Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.9002271Z     
2025-05-07T20:32:27.9002466Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.9002759Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.9003062Z         x = x_sign * x_clamp
2025-05-07T20:32:27.9003308Z         x0 = x[:, :D]
2025-05-07T20:32:27.9003530Z         x1 = x[:, D:]
2025-05-07T20:32:27.9003737Z     
2025-05-07T20:32:27.9003925Z         if contiguous:
2025-05-07T20:32:27.9004175Z             x0 = x0.contiguous()
2025-05-07T20:32:27.9004433Z             x1 = x1.contiguous()
2025-05-07T20:32:27.9004676Z     
2025-05-07T20:32:27.9005099Z         if scale_ub is not None:
2025-05-07T20:32:27.9005440Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.9012504Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.9012853Z             )
2025-05-07T20:32:27.9013117Z         else:
2025-05-07T20:32:27.9013324Z             scale_ub_tensor = None
2025-05-07T20:32:27.9013578Z     
2025-05-07T20:32:27.9013987Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.9014300Z             op = silu_mul_quant
2025-05-07T20:32:27.9014550Z             if compiled:
2025-05-07T20:32:27.9014797Z                 op = torch.compile(op)
2025-05-07T20:32:27.9015087Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.9015360Z     
2025-05-07T20:32:27.9015557Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.9015719Z 
2025-05-07T20:32:27.9015818Z moe/activation_test.py:117: 
2025-05-07T20:32:27.9016118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.9016450Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.9016730Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.9017410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.9018093Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.9018624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.9019290Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.9019938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.9020463Z     kernel = self.compile(
2025-05-07T20:32:27.9020995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.9021634Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.9022026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.9022249Z 
2025-05-07T20:32:27.9022457Z self = <triton.compiler.compiler.ASTSource object at 0x7fb1083c0aa0>
2025-05-07T20:32:27.9023521Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.9024868Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb109960220>}
2025-05-07T20:32:27.9026180Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.9027184Z context = <triton._C.libtriton.ir.context object at 0x7fb103789230>
2025-05-07T20:32:27.9027465Z 
2025-05-07T20:32:27.9027635Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.9028138Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.9028602Z                            module_map=module_map)
2025-05-07T20:32:27.9028974Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.9029321Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.9029578Z E       ^
2025-05-07T20:32:27.9030039Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.9030479Z 
2025-05-07T20:32:27.9030894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.9031391Z 
2025-05-07T20:32:27.9031598Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.9032007Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.9032406Z     T=4096,
2025-05-07T20:32:27.9032595Z     D=5120,
2025-05-07T20:32:27.9032810Z     scale_ub=1200.0,
2025-05-07T20:32:27.9033057Z     contiguous=True,
2025-05-07T20:32:27.9033282Z     compiled=False,
2025-05-07T20:32:27.9033482Z )
2025-05-07T20:32:27.9033801Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.9034364Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:27.9034629Z 
2025-05-07T20:32:27.9034707Z     @given(
2025-05-07T20:32:27.9034932Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.9035240Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.9035536Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.9035861Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.9036193Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.9036475Z     )
2025-05-07T20:32:27.9036816Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.9037249Z     def test_silu_mul_quant(
2025-05-07T20:32:27.9037491Z         self,
2025-05-07T20:32:27.9037679Z         T: int,
2025-05-07T20:32:27.9037873Z         D: int,
2025-05-07T20:32:27.9038096Z         scale_ub: Optional[float],
2025-05-07T20:32:27.9038367Z         contiguous: bool,
2025-05-07T20:32:27.9038606Z         compiled: bool,
2025-05-07T20:32:27.9038828Z     ) -> None:
2025-05-07T20:32:27.9039035Z         torch.manual_seed(2025)
2025-05-07T20:32:27.9039274Z     
2025-05-07T20:32:27.9039539Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.9039868Z     
2025-05-07T20:32:27.9040061Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.9040351Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.9040662Z         x = x_sign * x_clamp
2025-05-07T20:32:27.9040906Z         x0 = x[:, :D]
2025-05-07T20:32:27.9041142Z         x1 = x[:, D:]
2025-05-07T20:32:27.9041353Z     
2025-05-07T20:32:27.9041535Z         if contiguous:
2025-05-07T20:32:27.9041773Z             x0 = x0.contiguous()
2025-05-07T20:32:27.9042031Z             x1 = x1.contiguous()
2025-05-07T20:32:27.9042265Z     
2025-05-07T20:32:27.9042460Z         if scale_ub is not None:
2025-05-07T20:32:27.9042745Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.9043128Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.9043428Z             )
2025-05-07T20:32:27.9043625Z         else:
2025-05-07T20:32:27.9043845Z             scale_ub_tensor = None
2025-05-07T20:32:27.9044091Z     
2025-05-07T20:32:27.9044325Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.9044635Z             op = silu_mul_quant
2025-05-07T20:32:27.9044876Z             if compiled:
2025-05-07T20:32:27.9045122Z                 op = torch.compile(op)
2025-05-07T20:32:27.9045419Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.9045685Z     
2025-05-07T20:32:27.9045878Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.9046039Z 
2025-05-07T20:32:27.9046143Z moe/activation_test.py:117: 
2025-05-07T20:32:27.9046433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.9046761Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.9047044Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.9047730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.9048405Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.9048937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.9049607Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.9050339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.9050869Z     kernel = self.compile(
2025-05-07T20:32:27.9051406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.9052045Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.9052425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.9052734Z 
2025-05-07T20:32:27.9052939Z self = <triton.compiler.compiler.ASTSource object at 0x7fb108348a40>
2025-05-07T20:32:27.9054069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.9055417Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb1038b09a0>}
2025-05-07T20:32:27.9056728Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.9057734Z context = <triton._C.libtriton.ir.context object at 0x7fb102962bf0>
2025-05-07T20:32:27.9058029Z 
2025-05-07T20:32:27.9058192Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.9058707Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.9059166Z                            module_map=module_map)
2025-05-07T20:32:27.9059891Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.9060245Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.9060500Z E       ^
2025-05-07T20:32:27.9060957Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.9061403Z 
2025-05-07T20:32:27.9061812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.9062311Z 
2025-05-07T20:32:27.9062419Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.9062830Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.9063229Z     T=1,
2025-05-07T20:32:27.9063412Z     D=5120,
2025-05-07T20:32:27.9063606Z     scale_ub=None,
2025-05-07T20:32:27.9063818Z     contiguous=True,
2025-05-07T20:32:27.9064045Z     compiled=True,
2025-05-07T20:32:27.9064246Z )
2025-05-07T20:32:28.1384748Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:28.1385864Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:28.1387213Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:28.1389062Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:28.1390286Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:28.1392255Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:28.1393693Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.1394668Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:28.1396102Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:28.1397473Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.1398544Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:28.1399814Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:28.1401046Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:28.1402260Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:28.1403462Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:28.1404344Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:28.1405354Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:28.1406365Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:28.1407162Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^
2025-05-07T20:32:28.1408366Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:28.1409644Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:28.1410746Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:28.1411783Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:28.1412955Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:28.1414407Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:28.1415553Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.1416664Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.1417557Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:28.1418809Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.2082033Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:28.2083096Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:28.2085627Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:28.2088432Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:28.2090375Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:28.2092937Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:28.2094782Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.2095749Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:28.2096961Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:28.2098328Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.2099383Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:28.2100653Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:28.2101880Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:28.2103089Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:28.2104288Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:28.2105113Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:28.2106438Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:28.2107455Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:28.2108249Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^
2025-05-07T20:32:28.2109578Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:28.2110843Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:28.2111941Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:28.2112975Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:28.2114139Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:28.2115489Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:28.2116529Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.2117436Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.2118172Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:28.2119180Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.5019990Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.5021006Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:28.5021536Z 
2025-05-07T20:32:28.5021700Z     @given(
2025-05-07T20:32:28.5022176Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.5022819Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.5023427Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.5023832Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.5024218Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.5024508Z     )
2025-05-07T20:32:28.5024874Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.5025324Z     def test_silu_mul_quant(
2025-05-07T20:32:28.5025563Z         self,
2025-05-07T20:32:28.5025769Z         T: int,
2025-05-07T20:32:28.5025977Z         D: int,
2025-05-07T20:32:28.5026211Z         scale_ub: Optional[float],
2025-05-07T20:32:28.5026492Z         contiguous: bool,
2025-05-07T20:32:28.5026742Z         compiled: bool,
2025-05-07T20:32:28.5026973Z     ) -> None:
2025-05-07T20:32:28.5027206Z         torch.manual_seed(2025)
2025-05-07T20:32:28.5027460Z     
2025-05-07T20:32:28.5027735Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.5028172Z     
2025-05-07T20:32:28.5028461Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.5028786Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.5029465Z         x = x_sign * x_clamp
2025-05-07T20:32:28.5029733Z         x0 = x[:, :D]
2025-05-07T20:32:28.5029972Z         x1 = x[:, D:]
2025-05-07T20:32:28.5030226Z     
2025-05-07T20:32:28.5030426Z         if contiguous:
2025-05-07T20:32:28.5030682Z             x0 = x0.contiguous()
2025-05-07T20:32:28.5030960Z             x1 = x1.contiguous()
2025-05-07T20:32:28.5031212Z     
2025-05-07T20:32:28.5031424Z         if scale_ub is not None:
2025-05-07T20:32:28.5031882Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.5032228Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.5032552Z             )
2025-05-07T20:32:28.5032770Z         else:
2025-05-07T20:32:28.5032996Z             scale_ub_tensor = None
2025-05-07T20:32:28.5033275Z     
2025-05-07T20:32:28.5033531Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.5033856Z             op = silu_mul_quant
2025-05-07T20:32:28.5034127Z             if compiled:
2025-05-07T20:32:28.5034403Z                 op = torch.compile(op)
2025-05-07T20:32:28.5034719Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.5035002Z     
2025-05-07T20:32:28.5035216Z         y_fp8, y_scale = fn()
2025-05-07T20:32:28.5035529Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:28.5035831Z     
2025-05-07T20:32:28.5036087Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.5036447Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:28.5036744Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:28.5037069Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:28.5037441Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.5037754Z     
2025-05-07T20:32:28.5037975Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:28.5038183Z 
2025-05-07T20:32:28.5038292Z moe/activation_test.py:126: 
2025-05-07T20:32:28.5038613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.5038959Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:28.5039299Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.5040102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:28.5040855Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:28.5041418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.5042107Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.5042803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:28.5043528Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.5044270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:28.5044918Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:28.5045529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:28.5046050Z     fn()
2025-05-07T20:32:28.5046565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:28.5047162Z     self.fn.run(
2025-05-07T20:32:28.5047635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.5048175Z     kernel = self.compile(
2025-05-07T20:32:28.5048727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.5049386Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.5049898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.5050145Z 
2025-05-07T20:32:28.5050359Z self = <triton.compiler.compiler.ASTSource object at 0x7fb103876540>
2025-05-07T20:32:28.5051446Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.5052941Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb1038b2840>}
2025-05-07T20:32:28.5054424Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.5055457Z context = <triton._C.libtriton.ir.context object at 0x7fb102952270>
2025-05-07T20:32:28.5055767Z 
2025-05-07T20:32:28.5055943Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.5056472Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.5056946Z                            module_map=module_map)
2025-05-07T20:32:28.5057333Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.5057710Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:28.5057999Z E       ^
2025-05-07T20:32:28.5058482Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.5058942Z 
2025-05-07T20:32:28.5059659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.5060166Z 
2025-05-07T20:32:28.5060285Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.5060710Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.5061121Z     T=2048,
2025-05-07T20:32:28.5061323Z     D=5120,
2025-05-07T20:32:28.5061523Z     scale_ub=None,
2025-05-07T20:32:28.5061754Z     contiguous=True,
2025-05-07T20:32:28.5061990Z     compiled=True,
2025-05-07T20:32:28.5062204Z )
2025-05-07T20:32:28.7248431Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:28.7249532Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:28.7250866Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:28.7252286Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:28.7253373Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:28.7254663Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:28.7256039Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.7257016Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:28.7258596Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:28.7260211Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.7261431Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:28.7262698Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:28.7263941Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:28.7265139Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:28.7266335Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:28.7267165Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:28.7268180Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:28.7269193Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:28.7269971Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^
2025-05-07T20:32:28.7271169Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:28.7272449Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:28.7273560Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:28.7274581Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:28.7275751Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:28.7277094Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:28.7278150Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.7279055Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.7279780Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:28.7280906Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.7939378Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:28.7941436Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:28.7943703Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:28.7945094Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:28.7946067Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:28.7947354Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:28.7948718Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.7949688Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:28.7950897Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:28.7952251Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.7953315Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:28.7954589Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:28.7955813Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:28.7957022Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:28.7958207Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:28.7959027Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:28.7960469Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:28.7961709Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:28.7962791Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^
2025-05-07T20:32:28.7963992Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:28.7965256Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:28.7966469Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:28.7967496Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:28.7968656Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:28.7969991Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:28.7971039Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.7971950Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.7972670Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:28.7973829Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.0895599Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.0896146Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:29.0896416Z 
2025-05-07T20:32:29.0896497Z     @given(
2025-05-07T20:32:29.0896734Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.0897053Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.0897361Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.0897686Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.0903659Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.0903999Z     )
2025-05-07T20:32:29.0904372Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.0904816Z     def test_silu_mul_quant(
2025-05-07T20:32:29.0905045Z         self,
2025-05-07T20:32:29.0905241Z         T: int,
2025-05-07T20:32:29.0905431Z         D: int,
2025-05-07T20:32:29.0905648Z         scale_ub: Optional[float],
2025-05-07T20:32:29.0905910Z         contiguous: bool,
2025-05-07T20:32:29.0906142Z         compiled: bool,
2025-05-07T20:32:29.0906358Z     ) -> None:
2025-05-07T20:32:29.0906570Z         torch.manual_seed(2025)
2025-05-07T20:32:29.0906806Z     
2025-05-07T20:32:29.0907071Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.0907401Z     
2025-05-07T20:32:29.0907592Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.0907871Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.0908173Z         x = x_sign * x_clamp
2025-05-07T20:32:29.0908410Z         x0 = x[:, :D]
2025-05-07T20:32:29.0908623Z         x1 = x[:, D:]
2025-05-07T20:32:29.0908821Z     
2025-05-07T20:32:29.0909001Z         if contiguous:
2025-05-07T20:32:29.0909227Z             x0 = x0.contiguous()
2025-05-07T20:32:29.0909467Z             x1 = x1.contiguous()
2025-05-07T20:32:29.0909696Z     
2025-05-07T20:32:29.0909883Z         if scale_ub is not None:
2025-05-07T20:32:29.0910336Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.0910671Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.0910972Z             )
2025-05-07T20:32:29.0911152Z         else:
2025-05-07T20:32:29.0911358Z             scale_ub_tensor = None
2025-05-07T20:32:29.0911603Z     
2025-05-07T20:32:29.0911821Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.0912270Z             op = silu_mul_quant
2025-05-07T20:32:29.0912510Z             if compiled:
2025-05-07T20:32:29.0912746Z                 op = torch.compile(op)
2025-05-07T20:32:29.0913032Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.0913298Z     
2025-05-07T20:32:29.0913487Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.0913760Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.0914041Z     
2025-05-07T20:32:29.0914274Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.0914604Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.0914889Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.0915195Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.0915539Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.0915845Z     
2025-05-07T20:32:29.0916041Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.0916231Z 
2025-05-07T20:32:29.0916337Z moe/activation_test.py:126: 
2025-05-07T20:32:29.0916625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.0916950Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.0917266Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.0918036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.0918771Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.0919306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.0919969Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.0920642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.0921347Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.0922062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.0922681Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.0923266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.0923770Z     fn()
2025-05-07T20:32:29.0924271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.0924831Z     self.fn.run(
2025-05-07T20:32:29.0925289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.0925801Z     kernel = self.compile(
2025-05-07T20:32:29.0926322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.0926962Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.0927348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.0927569Z 
2025-05-07T20:32:29.0927774Z self = <triton.compiler.compiler.ASTSource object at 0x7fb109a9c5c0>
2025-05-07T20:32:29.0928909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.0930258Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb1097e8e00>}
2025-05-07T20:32:29.0931568Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.0932644Z context = <triton._C.libtriton.ir.context object at 0x7fb102880730>
2025-05-07T20:32:29.0932922Z 
2025-05-07T20:32:29.0933180Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.0933734Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.0934195Z                            module_map=module_map)
2025-05-07T20:32:29.0934549Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.0934906Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.0935163Z E       ^
2025-05-07T20:32:29.0935617Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.0936056Z 
2025-05-07T20:32:29.0936465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.0936962Z 
2025-05-07T20:32:29.0937071Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.0937475Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.0937867Z     T=128,
2025-05-07T20:32:29.0938053Z     D=5120,
2025-05-07T20:32:29.0938240Z     scale_ub=None,
2025-05-07T20:32:29.0938447Z     contiguous=True,
2025-05-07T20:32:29.0938668Z     compiled=True,
2025-05-07T20:32:29.0938868Z )
2025-05-07T20:32:29.3299591Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:29.3300649Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:29.3301969Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:29.3303368Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:29.3304322Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:29.3305609Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:29.3306964Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.3307933Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:29.3309137Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:29.3310694Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.3311744Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:29.3313002Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:29.3314336Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:29.3315533Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:29.3316722Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:29.3317530Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:29.3318531Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:29.3319536Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:29.3320304Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^
2025-05-07T20:32:29.3321490Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:29.3322750Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:29.3323847Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:29.3324861Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:29.3326015Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:29.3327345Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:29.3328539Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.3329439Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.3330156Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:29.3331163Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.3999856Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:29.4001044Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:29.4002362Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:29.4003757Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:29.4004855Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:29.4006137Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:29.4007502Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4008470Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:29.4009678Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:29.4011038Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4012089Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:29.4013429Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:29.4014657Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:29.4015870Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:29.4017058Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:29.4017884Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:29.4018901Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:29.4019912Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:29.4020711Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^
2025-05-07T20:32:29.4021898Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:29.4023256Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:29.4024411Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:29.4025442Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:29.4026604Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:29.4028020Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:29.4029074Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4029971Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4030701Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:29.4031692Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.7403283Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.7403982Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:29.7404256Z 
2025-05-07T20:32:29.7404339Z     @given(
2025-05-07T20:32:29.7404578Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.7404891Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.7405217Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.7405556Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.7405895Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.7406173Z     )
2025-05-07T20:32:29.7406522Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.7406967Z     def test_silu_mul_quant(
2025-05-07T20:32:29.7407204Z         self,
2025-05-07T20:32:29.7407409Z         T: int,
2025-05-07T20:32:29.7407613Z         D: int,
2025-05-07T20:32:29.7407830Z         scale_ub: Optional[float],
2025-05-07T20:32:29.7408102Z         contiguous: bool,
2025-05-07T20:32:29.7408351Z         compiled: bool,
2025-05-07T20:32:29.7408575Z     ) -> None:
2025-05-07T20:32:29.7408800Z         torch.manual_seed(2025)
2025-05-07T20:32:29.7409050Z     
2025-05-07T20:32:29.7409318Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.7409664Z     
2025-05-07T20:32:29.7409866Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.7410157Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.7410463Z         x = x_sign * x_clamp
2025-05-07T20:32:29.7410710Z         x0 = x[:, :D]
2025-05-07T20:32:29.7410928Z         x1 = x[:, D:]
2025-05-07T20:32:29.7411129Z     
2025-05-07T20:32:29.7411325Z         if contiguous:
2025-05-07T20:32:29.7411557Z             x0 = x0.contiguous()
2025-05-07T20:32:29.7411826Z             x1 = x1.contiguous()
2025-05-07T20:32:29.7412070Z     
2025-05-07T20:32:29.7412262Z         if scale_ub is not None:
2025-05-07T20:32:29.7412542Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.7412886Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.7413262Z             )
2025-05-07T20:32:29.7413489Z         else:
2025-05-07T20:32:29.7413703Z             scale_ub_tensor = None
2025-05-07T20:32:29.7413956Z     
2025-05-07T20:32:29.7414182Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.7414669Z             op = silu_mul_quant
2025-05-07T20:32:29.7414934Z             if compiled:
2025-05-07T20:32:29.7415183Z                 op = torch.compile(op)
2025-05-07T20:32:29.7415489Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.7415766Z     
2025-05-07T20:32:29.7415963Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.7416253Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.7416665Z     
2025-05-07T20:32:29.7416897Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.7417229Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.7417521Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.7417827Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.7418181Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.7418490Z     
2025-05-07T20:32:29.7418696Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.7418888Z 
2025-05-07T20:32:29.7419000Z moe/activation_test.py:126: 
2025-05-07T20:32:29.7419304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.7419643Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.7419962Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.7420746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.7421497Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.7422042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.7422714Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.7423396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.7424117Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.7424844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.7425472Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.7426068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.7426588Z     fn()
2025-05-07T20:32:29.7427092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.7427669Z     self.fn.run(
2025-05-07T20:32:29.7428138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.7428666Z     kernel = self.compile(
2025-05-07T20:32:29.7429200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.7429845Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.7430244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.7430469Z 
2025-05-07T20:32:29.7430676Z self = <triton.compiler.compiler.ASTSource object at 0x7fb103987710>
2025-05-07T20:32:29.7431751Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.7433119Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb102b08cc0>}
2025-05-07T20:32:29.7434496Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.7435594Z context = <triton._C.libtriton.ir.context object at 0x7fb10340e9f0>
2025-05-07T20:32:29.7435883Z 
2025-05-07T20:32:29.7436050Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.7436571Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.7437044Z                            module_map=module_map)
2025-05-07T20:32:29.7437492Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.7437848Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.7438124Z E       ^
2025-05-07T20:32:29.7438589Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.7439034Z 
2025-05-07T20:32:29.7439445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.7439956Z 
2025-05-07T20:32:29.7440071Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.7440484Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.7440889Z     T=4096,
2025-05-07T20:32:29.7441077Z     D=5120,
2025-05-07T20:32:29.7441279Z     scale_ub=None,
2025-05-07T20:32:29.7441500Z     contiguous=True,
2025-05-07T20:32:29.7441721Z     compiled=True,
2025-05-07T20:32:29.7441927Z )
2025-05-07T20:32:29.9813918Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:29.9814987Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:29.9816304Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:29.9817708Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:29.9818664Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:29.9819949Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:29.9821304Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.9822264Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:29.9823471Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:29.9824821Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.9825875Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:29.9827602Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:29.9828835Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:29.9830035Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:29.9831338Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:29.9832156Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:29.9833158Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:29.9834157Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:29.9834937Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:29.9836130Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:29.9837391Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:29.9838490Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:29.9839508Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:29.9840667Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:29.9842001Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:29.9843047Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.9843985Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.9844720Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:29.9845725Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.0513005Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:30.0514115Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:30.0515425Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:30.0516972Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:30.0517941Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:30.0519221Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:30.0520692Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.0521662Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:30.0523032Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:30.0524390Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.0525444Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:30.0526710Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:30.0527939Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:30.0529141Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:30.0530332Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:30.0531151Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:30.0532176Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:30.0533256Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:30.0534090Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:30.0535283Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:30.0536556Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:30.0537666Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:30.0538694Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:30.0539939Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:30.0541281Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:30.0542404Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.0543314Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.0544090Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:30.0545105Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.3942061Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.3942601Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:30.3942869Z 
2025-05-07T20:32:30.3942956Z     @given(
2025-05-07T20:32:30.3943187Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.3943778Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.3944389Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.3945042Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.3945682Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.3946237Z     )
2025-05-07T20:32:30.3946917Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.3947782Z     def test_silu_mul_quant(
2025-05-07T20:32:30.3948267Z         self,
2025-05-07T20:32:30.3948664Z         T: int,
2025-05-07T20:32:30.3949049Z         D: int,
2025-05-07T20:32:30.3949487Z         scale_ub: Optional[float],
2025-05-07T20:32:30.3950023Z         contiguous: bool,
2025-05-07T20:32:30.3950488Z         compiled: bool,
2025-05-07T20:32:30.3950939Z     ) -> None:
2025-05-07T20:32:30.3951359Z         torch.manual_seed(2025)
2025-05-07T20:32:30.3951824Z     
2025-05-07T20:32:30.3952360Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.3953056Z     
2025-05-07T20:32:30.3953434Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.3953757Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.3954087Z         x = x_sign * x_clamp
2025-05-07T20:32:30.3954334Z         x0 = x[:, :D]
2025-05-07T20:32:30.3954549Z         x1 = x[:, D:]
2025-05-07T20:32:30.3954757Z     
2025-05-07T20:32:30.3954947Z         if contiguous:
2025-05-07T20:32:30.3955176Z             x0 = x0.contiguous()
2025-05-07T20:32:30.3955450Z             x1 = x1.contiguous()
2025-05-07T20:32:30.3955694Z     
2025-05-07T20:32:30.3955889Z         if scale_ub is not None:
2025-05-07T20:32:30.3956165Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.3956505Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.3956813Z             )
2025-05-07T20:32:30.3957013Z         else:
2025-05-07T20:32:30.3957228Z             scale_ub_tensor = None
2025-05-07T20:32:30.3957483Z     
2025-05-07T20:32:30.3957721Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.3958035Z             op = silu_mul_quant
2025-05-07T20:32:30.3958283Z             if compiled:
2025-05-07T20:32:30.3958538Z                 op = torch.compile(op)
2025-05-07T20:32:30.3958835Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.3959111Z     
2025-05-07T20:32:30.3959477Z         y_fp8, y_scale = fn()
2025-05-07T20:32:30.3959764Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:30.3960048Z     
2025-05-07T20:32:30.3960447Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.3960783Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:30.3961075Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:30.3961383Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:30.3961737Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.3962162Z     
2025-05-07T20:32:30.3962362Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:30.3962558Z 
2025-05-07T20:32:30.3962657Z moe/activation_test.py:126: 
2025-05-07T20:32:30.3969050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.3969406Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:30.3969732Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.3970527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:30.3971271Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:30.3971809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.3972474Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.3973218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:30.3973932Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:30.3974651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:30.3975273Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:30.3975865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:30.3976373Z     fn()
2025-05-07T20:32:30.3976876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:30.3977446Z     self.fn.run(
2025-05-07T20:32:30.3977909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.3978433Z     kernel = self.compile(
2025-05-07T20:32:30.3978961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.3979603Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.3979994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.3980222Z 
2025-05-07T20:32:30.3980427Z self = <triton.compiler.compiler.ASTSource object at 0x7fb10319ed50>
2025-05-07T20:32:30.3981488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.3982836Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb102b0ad40>}
2025-05-07T20:32:30.3984202Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.3985215Z context = <triton._C.libtriton.ir.context object at 0x7fb102500f30>
2025-05-07T20:32:30.3985498Z 
2025-05-07T20:32:30.3985667Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.3986175Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.3986642Z                            module_map=module_map)
2025-05-07T20:32:30.3987111Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.3987461Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:30.3987729Z E       ^
2025-05-07T20:32:30.3988189Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.3988630Z 
2025-05-07T20:32:30.3989037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.3989643Z 
2025-05-07T20:32:30.3989755Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.3990160Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.3990552Z     T=16384,
2025-05-07T20:32:30.3990739Z     D=5120,
2025-05-07T20:32:30.3990932Z     scale_ub=None,
2025-05-07T20:32:30.3991146Z     contiguous=True,
2025-05-07T20:32:30.3991367Z     compiled=True,
2025-05-07T20:32:30.3991569Z )
2025-05-07T20:32:30.4242071Z W0507 20:32:30.422000 98555 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:30.4243342Z W0507 20:32:30.422000 98555 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:30.4244677Z W0507 20:32:30.422000 98555 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:30.4245662Z W0507 20:32:30.422000 98555 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:30.4246759Z W0507 20:32:30.422000 98555 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:30.5124276Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.5124808Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:30.5125082Z 
2025-05-07T20:32:30.5125168Z     @given(
2025-05-07T20:32:30.5125396Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.5125708Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.5126017Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.5126352Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.5126677Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.5126961Z     )
2025-05-07T20:32:30.5127313Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.5127819Z     def test_silu_mul_quant(
2025-05-07T20:32:30.5128076Z         self,
2025-05-07T20:32:30.5128282Z         T: int,
2025-05-07T20:32:30.5128482Z         D: int,
2025-05-07T20:32:30.5128716Z         scale_ub: Optional[float],
2025-05-07T20:32:30.5129020Z         contiguous: bool,
2025-05-07T20:32:30.5129274Z         compiled: bool,
2025-05-07T20:32:30.5129514Z     ) -> None:
2025-05-07T20:32:30.5129745Z         torch.manual_seed(2025)
2025-05-07T20:32:30.5130004Z     
2025-05-07T20:32:30.5130297Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.5130686Z     
2025-05-07T20:32:30.5130885Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.5131211Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.5131558Z         x = x_sign * x_clamp
2025-05-07T20:32:30.5131809Z         x0 = x[:, :D]
2025-05-07T20:32:30.5132040Z         x1 = x[:, D:]
2025-05-07T20:32:30.5132259Z     
2025-05-07T20:32:30.5132448Z         if contiguous:
2025-05-07T20:32:30.5132696Z             x0 = x0.contiguous()
2025-05-07T20:32:30.5132973Z             x1 = x1.contiguous()
2025-05-07T20:32:30.5133287Z     
2025-05-07T20:32:30.5133482Z         if scale_ub is not None:
2025-05-07T20:32:30.5133948Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.5134291Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.5134592Z             )
2025-05-07T20:32:30.5134787Z         else:
2025-05-07T20:32:30.5135007Z             scale_ub_tensor = None
2025-05-07T20:32:30.5135260Z     
2025-05-07T20:32:30.5135493Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.5135805Z             op = silu_mul_quant
2025-05-07T20:32:30.5136175Z             if compiled:
2025-05-07T20:32:30.5136422Z                 op = torch.compile(op)
2025-05-07T20:32:30.5136718Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.5136982Z     
2025-05-07T20:32:30.5137177Z         y_fp8, y_scale = fn()
2025-05-07T20:32:30.5137458Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:30.5137739Z     
2025-05-07T20:32:30.5137973Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.5138300Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:30.5138593Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:30.5138897Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:30.5139249Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.5139552Z     
2025-05-07T20:32:30.5139751Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:30.5139945Z 
2025-05-07T20:32:30.5140045Z moe/activation_test.py:126: 
2025-05-07T20:32:30.5140343Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.5140663Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:30.5140984Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.5141762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:30.5142507Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:30.5143039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.5143711Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.5144388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:30.5145100Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:30.5145821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:30.5146447Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:30.5147046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:30.5147547Z     fn()
2025-05-07T20:32:30.5148058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:30.5148637Z     self.fn.run(
2025-05-07T20:32:30.5149102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.5149619Z     kernel = self.compile(
2025-05-07T20:32:30.5150166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.5150814Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.5151210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.5151441Z 
2025-05-07T20:32:30.5151646Z self = <triton.compiler.compiler.ASTSource object at 0x7fb1024bb410>
2025-05-07T20:32:30.5152721Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.5154240Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb102c82840>}
2025-05-07T20:32:30.5155566Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.5156639Z context = <triton._C.libtriton.ir.context object at 0x7fb1026078f0>
2025-05-07T20:32:30.5156930Z 
2025-05-07T20:32:30.5157095Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.5157613Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.5158080Z                            module_map=module_map)
2025-05-07T20:32:30.5158441Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.5158795Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:30.5159076Z E       ^
2025-05-07T20:32:30.5159700Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.5160163Z 
2025-05-07T20:32:30.5160576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.5161084Z 
2025-05-07T20:32:30.5161192Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.5161607Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.5162001Z     T=1,
2025-05-07T20:32:30.5162190Z     D=5120,
2025-05-07T20:32:30.5162382Z     scale_ub=1200.0,
2025-05-07T20:32:30.5162597Z     contiguous=True,
2025-05-07T20:32:30.5162818Z     compiled=True,
2025-05-07T20:32:30.5163024Z )
2025-05-07T20:32:30.6543647Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.6544818Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:30.6545571Z 
2025-05-07T20:32:30.6545770Z     @given(
2025-05-07T20:32:30.6546357Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.6547020Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.6547629Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.6548272Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.6548907Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.6549471Z     )
2025-05-07T20:32:30.6550150Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.6551004Z     def test_silu_mul_quant(
2025-05-07T20:32:30.6551472Z         self,
2025-05-07T20:32:30.6551858Z         T: int,
2025-05-07T20:32:30.6552243Z         D: int,
2025-05-07T20:32:30.6552660Z         scale_ub: Optional[float],
2025-05-07T20:32:30.6553188Z         contiguous: bool,
2025-05-07T20:32:30.6553646Z         compiled: bool,
2025-05-07T20:32:30.6553864Z     ) -> None:
2025-05-07T20:32:30.6554080Z         torch.manual_seed(2025)
2025-05-07T20:32:30.6554316Z     
2025-05-07T20:32:30.6554581Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.6554926Z     
2025-05-07T20:32:30.6555126Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.6555411Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.6555729Z         x = x_sign * x_clamp
2025-05-07T20:32:30.6555976Z         x0 = x[:, :D]
2025-05-07T20:32:30.6556187Z         x1 = x[:, D:]
2025-05-07T20:32:30.6556395Z     
2025-05-07T20:32:30.6556579Z         if contiguous:
2025-05-07T20:32:30.6556801Z             x0 = x0.contiguous()
2025-05-07T20:32:30.6557058Z             x1 = x1.contiguous()
2025-05-07T20:32:30.6557296Z     
2025-05-07T20:32:30.6557489Z         if scale_ub is not None:
2025-05-07T20:32:30.6557758Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.6558095Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.6558572Z             )
2025-05-07T20:32:30.6558764Z         else:
2025-05-07T20:32:30.6558978Z             scale_ub_tensor = None
2025-05-07T20:32:30.6559455Z     
2025-05-07T20:32:30.6559717Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.6560026Z             op = silu_mul_quant
2025-05-07T20:32:30.6560274Z             if compiled:
2025-05-07T20:32:30.6560521Z                 op = torch.compile(op)
2025-05-07T20:32:30.6560947Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.6561220Z     
2025-05-07T20:32:30.6561409Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.6561571Z 
2025-05-07T20:32:30.6561669Z moe/activation_test.py:117: 
2025-05-07T20:32:30.6561964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.6562292Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.6562563Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.6563131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:30.6563688Z     return fn(*args, **kwargs)
2025-05-07T20:32:30.6564332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.6564998Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.6565523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.6566198Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.6566840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.6567363Z     kernel = self.compile(
2025-05-07T20:32:30.6567905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.6568544Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.6568944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.6569173Z 
2025-05-07T20:32:30.6569379Z self = <triton.compiler.compiler.ASTSource object at 0x7fb103875850>
2025-05-07T20:32:30.6570446Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.6571799Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb102dccf40>}
2025-05-07T20:32:30.6573180Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.6574189Z context = <triton._C.libtriton.ir.context object at 0x7fb1026e4d30>
2025-05-07T20:32:30.6574475Z 
2025-05-07T20:32:30.6574639Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.6575145Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.6575595Z                            module_map=module_map)
2025-05-07T20:32:30.6575957Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.6576310Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.6576562Z E       ^
2025-05-07T20:32:30.6577014Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.6577458Z 
2025-05-07T20:32:30.6577868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.6578366Z 
2025-05-07T20:32:30.6578485Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.6579013Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.6579408Z     T=1,
2025-05-07T20:32:30.6579592Z     D=5120,
2025-05-07T20:32:30.6579781Z     scale_ub=None,
2025-05-07T20:32:30.6579992Z     contiguous=False,
2025-05-07T20:32:30.6580222Z     compiled=True,
2025-05-07T20:32:30.6580425Z )
2025-05-07T20:32:30.7195283Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.7196554Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:30.7197044Z 
2025-05-07T20:32:30.7197194Z     @given(
2025-05-07T20:32:30.7197622Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.7198206Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.7198763Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.7199378Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.7199994Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.7200523Z     )
2025-05-07T20:32:30.7201180Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.7201992Z     def test_silu_mul_quant(
2025-05-07T20:32:30.7202441Z         self,
2025-05-07T20:32:30.7202807Z         T: int,
2025-05-07T20:32:30.7203171Z         D: int,
2025-05-07T20:32:30.7203574Z         scale_ub: Optional[float],
2025-05-07T20:32:30.7204032Z         contiguous: bool,
2025-05-07T20:32:30.7204297Z         compiled: bool,
2025-05-07T20:32:30.7204522Z     ) -> None:
2025-05-07T20:32:30.7204748Z         torch.manual_seed(2025)
2025-05-07T20:32:30.7204995Z     
2025-05-07T20:32:30.7205271Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.7205731Z     
2025-05-07T20:32:30.7205981Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.7206334Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.7206644Z         x = x_sign * x_clamp
2025-05-07T20:32:30.7206891Z         x0 = x[:, :D]
2025-05-07T20:32:30.7207123Z         x1 = x[:, D:]
2025-05-07T20:32:30.7207331Z     
2025-05-07T20:32:30.7207516Z         if contiguous:
2025-05-07T20:32:30.7207750Z             x0 = x0.contiguous()
2025-05-07T20:32:30.7208000Z             x1 = x1.contiguous()
2025-05-07T20:32:30.7208241Z     
2025-05-07T20:32:30.7208430Z         if scale_ub is not None:
2025-05-07T20:32:30.7208694Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.7209035Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.7209341Z             )
2025-05-07T20:32:30.7209526Z         else:
2025-05-07T20:32:30.7209735Z             scale_ub_tensor = None
2025-05-07T20:32:30.7209980Z     
2025-05-07T20:32:30.7210202Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.7210516Z             op = silu_mul_quant
2025-05-07T20:32:30.7210766Z             if compiled:
2025-05-07T20:32:30.7211026Z                 op = torch.compile(op)
2025-05-07T20:32:30.7211320Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.7211599Z     
2025-05-07T20:32:30.7211800Z         y_fp8, y_scale = fn()
2025-05-07T20:32:30.7212082Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:30.7212373Z     
2025-05-07T20:32:30.7212610Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.7212938Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:30.7213331Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:30.7213647Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:30.7213999Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.7214311Z     
2025-05-07T20:32:30.7214510Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:30.7214701Z 
2025-05-07T20:32:30.7214809Z moe/activation_test.py:126: 
2025-05-07T20:32:30.7215101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.7215432Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:30.7215908Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.7216689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:30.7217437Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:30.7217984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.7218760Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.7219431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:30.7220141Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:30.7220861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:30.7221496Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:30.7222084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:30.7222592Z     fn()
2025-05-07T20:32:30.7223094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:30.7223664Z     self.fn.run(
2025-05-07T20:32:30.7224137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.7224658Z     kernel = self.compile(
2025-05-07T20:32:30.7225183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.7225826Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.7226222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.7226446Z 
2025-05-07T20:32:30.7226700Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017dd08c0>
2025-05-07T20:32:30.7227895Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.7229254Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb102dcef20>}
2025-05-07T20:32:30.7230588Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.7231600Z context = <triton._C.libtriton.ir.context object at 0x7fb102626670>
2025-05-07T20:32:30.7231881Z 
2025-05-07T20:32:30.7232050Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.7232572Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.7233042Z                            module_map=module_map)
2025-05-07T20:32:30.7233410Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.7233786Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:30.7234078Z E       ^
2025-05-07T20:32:30.7234538Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.7234986Z 
2025-05-07T20:32:30.7235402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.7235906Z 
2025-05-07T20:32:30.7236009Z Trying example: test_silu_mul_quant(
﻿2025-05-07T20:32:30.7239904Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.7240307Z     T=1,
2025-05-07T20:32:30.7240489Z     D=5120,
2025-05-07T20:32:30.7240792Z     scale_ub=None,
2025-05-07T20:32:30.7241006Z     contiguous=True,
2025-05-07T20:32:30.7241225Z     compiled=False,
2025-05-07T20:32:30.7241425Z )
2025-05-07T20:32:30.8736576Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.8737146Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:30.8737520Z 
2025-05-07T20:32:30.8737611Z     @given(
2025-05-07T20:32:30.8737973Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.8738288Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.8738599Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.8738925Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.8739255Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.8739562Z     )
2025-05-07T20:32:30.8739913Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.8740365Z     def test_silu_mul_quant(
2025-05-07T20:32:30.8740618Z         self,
2025-05-07T20:32:30.8740817Z         T: int,
2025-05-07T20:32:30.8741023Z         D: int,
2025-05-07T20:32:30.8741249Z         scale_ub: Optional[float],
2025-05-07T20:32:30.8741522Z         contiguous: bool,
2025-05-07T20:32:30.8741771Z         compiled: bool,
2025-05-07T20:32:30.8742003Z     ) -> None:
2025-05-07T20:32:30.8742223Z         torch.manual_seed(2025)
2025-05-07T20:32:30.8742480Z     
2025-05-07T20:32:30.8742757Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.8743093Z     
2025-05-07T20:32:30.8743303Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.8743597Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.8743913Z         x = x_sign * x_clamp
2025-05-07T20:32:30.8744216Z         x0 = x[:, :D]
2025-05-07T20:32:30.8744444Z         x1 = x[:, D:]
2025-05-07T20:32:30.8744665Z     
2025-05-07T20:32:30.8744859Z         if contiguous:
2025-05-07T20:32:30.8745113Z             x0 = x0.contiguous()
2025-05-07T20:32:30.8745382Z             x1 = x1.contiguous()
2025-05-07T20:32:30.8745622Z     
2025-05-07T20:32:30.8745822Z         if scale_ub is not None:
2025-05-07T20:32:30.8746094Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.8746424Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.8746734Z             )
2025-05-07T20:32:30.8746936Z         else:
2025-05-07T20:32:30.8747156Z             scale_ub_tensor = None
2025-05-07T20:32:30.8747421Z     
2025-05-07T20:32:30.8747661Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.8747974Z             op = silu_mul_quant
2025-05-07T20:32:30.8748235Z             if compiled:
2025-05-07T20:32:30.8755162Z                 op = torch.compile(op)
2025-05-07T20:32:30.8755504Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.8755785Z     
2025-05-07T20:32:30.8755982Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.8756145Z 
2025-05-07T20:32:30.8756249Z moe/activation_test.py:117: 
2025-05-07T20:32:30.8756546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.8756874Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.8757148Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.8757834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.8758516Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.8759046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.8759964Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.8760617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.8761299Z     kernel = self.compile(
2025-05-07T20:32:30.8761945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.8762595Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.8762993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.8763221Z 
2025-05-07T20:32:30.8763435Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017851ac0>
2025-05-07T20:32:30.8764540Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.8765954Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb102dcfa60>}
2025-05-07T20:32:30.8767276Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.8768285Z context = <triton._C.libtriton.ir.context object at 0x7fb017cb33f0>
2025-05-07T20:32:30.8768566Z 
2025-05-07T20:32:30.8768736Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.8769244Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.8769717Z                            module_map=module_map)
2025-05-07T20:32:30.8770085Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.8770430Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.8770690Z E       ^
2025-05-07T20:32:30.8771157Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.8771601Z 
2025-05-07T20:32:30.8772020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.8772518Z 
2025-05-07T20:32:30.8772622Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.8773092Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.8773491Z     T=128,
2025-05-07T20:32:30.8773679Z     D=5120,
2025-05-07T20:32:30.8773875Z     scale_ub=None,
2025-05-07T20:32:30.8774125Z     contiguous=False,
2025-05-07T20:32:30.8774368Z     compiled=True,
2025-05-07T20:32:30.8774576Z )
2025-05-07T20:32:30.8774894Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.8775385Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:30.8775645Z 
2025-05-07T20:32:30.8775724Z     @given(
2025-05-07T20:32:30.8775964Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.8776278Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.8776576Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.8776905Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.8777228Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.8777507Z     )
2025-05-07T20:32:30.8777858Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.8778295Z     def test_silu_mul_quant(
2025-05-07T20:32:30.8778537Z         self,
2025-05-07T20:32:30.8778726Z         T: int,
2025-05-07T20:32:30.8778927Z         D: int,
2025-05-07T20:32:30.8779144Z         scale_ub: Optional[float],
2025-05-07T20:32:30.8779411Z         contiguous: bool,
2025-05-07T20:32:30.8779653Z         compiled: bool,
2025-05-07T20:32:30.8779874Z     ) -> None:
2025-05-07T20:32:30.8780106Z         torch.manual_seed(2025)
2025-05-07T20:32:30.8780348Z     
2025-05-07T20:32:30.8780622Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.8781035Z     
2025-05-07T20:32:30.8781228Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.8781589Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.8781901Z         x = x_sign * x_clamp
2025-05-07T20:32:30.8782139Z         x0 = x[:, :D]
2025-05-07T20:32:30.8782355Z         x1 = x[:, D:]
2025-05-07T20:32:30.8782562Z     
2025-05-07T20:32:30.8782743Z         if contiguous:
2025-05-07T20:32:30.8782976Z             x0 = x0.contiguous()
2025-05-07T20:32:30.8783236Z             x1 = x1.contiguous()
2025-05-07T20:32:30.8783519Z     
2025-05-07T20:32:30.8783741Z         if scale_ub is not None:
2025-05-07T20:32:30.8784042Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.8784378Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.8784673Z             )
2025-05-07T20:32:30.8784873Z         else:
2025-05-07T20:32:30.8785088Z             scale_ub_tensor = None
2025-05-07T20:32:30.8785336Z     
2025-05-07T20:32:30.8785569Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.8785885Z             op = silu_mul_quant
2025-05-07T20:32:30.8786139Z             if compiled:
2025-05-07T20:32:30.8786388Z                 op = torch.compile(op)
2025-05-07T20:32:30.8786689Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.8786955Z     
2025-05-07T20:32:30.8787160Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.8787320Z 
2025-05-07T20:32:30.8787421Z moe/activation_test.py:117: 
2025-05-07T20:32:30.8787705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.8788045Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.8788331Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.8788891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:30.8789442Z     return fn(*args, **kwargs)
2025-05-07T20:32:30.8790096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.8790782Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.8791306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.8791972Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.8792622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.8793146Z     kernel = self.compile(
2025-05-07T20:32:30.8793791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.8794595Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.8795081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.8795360Z 
2025-05-07T20:32:30.8795641Z self = <triton.compiler.compiler.ASTSource object at 0x7fb0178c93d0>
2025-05-07T20:32:30.8796784Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.8798121Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10201d1c0>}
2025-05-07T20:32:30.8799432Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.8800442Z context = <triton._C.libtriton.ir.context object at 0x7fb01740c5b0>
2025-05-07T20:32:30.8800723Z 
2025-05-07T20:32:30.8800888Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.8801458Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.8801990Z                            module_map=module_map)
2025-05-07T20:32:30.8802353Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.8802701Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.8802950Z E       ^
2025-05-07T20:32:30.8803447Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.8803996Z 
2025-05-07T20:32:30.8804558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.8805185Z 
2025-05-07T20:32:30.8805317Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.8805778Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.8806171Z     T=128,
2025-05-07T20:32:30.8806360Z     D=7168,
2025-05-07T20:32:30.8806554Z     scale_ub=1200.0,
2025-05-07T20:32:30.8806787Z     contiguous=False,
2025-05-07T20:32:30.8807012Z     compiled=False,
2025-05-07T20:32:30.8807222Z )
2025-05-07T20:32:30.9939543Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.9940147Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:30.9940540Z 
2025-05-07T20:32:30.9940657Z     @given(
2025-05-07T20:32:30.9940912Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.9941232Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.9941567Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.9941909Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.9942243Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.9942545Z     )
2025-05-07T20:32:30.9942906Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.9943365Z     def test_silu_mul_quant(
2025-05-07T20:32:30.9943614Z         self,
2025-05-07T20:32:30.9943819Z         T: int,
2025-05-07T20:32:30.9944033Z         D: int,
2025-05-07T20:32:30.9944260Z         scale_ub: Optional[float],
2025-05-07T20:32:30.9944542Z         contiguous: bool,
2025-05-07T20:32:30.9944793Z         compiled: bool,
2025-05-07T20:32:30.9945027Z     ) -> None:
2025-05-07T20:32:30.9945255Z         torch.manual_seed(2025)
2025-05-07T20:32:30.9945512Z     
2025-05-07T20:32:30.9945790Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.9946149Z     
2025-05-07T20:32:30.9946360Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.9946653Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.9946978Z         x = x_sign * x_clamp
2025-05-07T20:32:30.9947230Z         x0 = x[:, :D]
2025-05-07T20:32:30.9947452Z         x1 = x[:, D:]
2025-05-07T20:32:30.9947673Z     
2025-05-07T20:32:30.9947871Z         if contiguous:
2025-05-07T20:32:30.9948115Z             x0 = x0.contiguous()
2025-05-07T20:32:30.9948388Z             x1 = x1.contiguous()
2025-05-07T20:32:30.9948638Z     
2025-05-07T20:32:30.9948840Z         if scale_ub is not None:
2025-05-07T20:32:30.9949121Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.9949462Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.9949777Z             )
2025-05-07T20:32:30.9949975Z         else:
2025-05-07T20:32:30.9950199Z             scale_ub_tensor = None
2025-05-07T20:32:30.9950461Z     
2025-05-07T20:32:30.9950700Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.9951024Z             op = silu_mul_quant
2025-05-07T20:32:30.9951290Z             if compiled:
2025-05-07T20:32:30.9951545Z                 op = torch.compile(op)
2025-05-07T20:32:30.9951856Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.9952136Z     
2025-05-07T20:32:30.9952334Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.9952711Z 
2025-05-07T20:32:30.9952815Z moe/activation_test.py:117: 
2025-05-07T20:32:30.9953122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.9953600Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.9953889Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.9954587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.9955276Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.9955811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.9956636Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.9957296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.9957833Z     kernel = self.compile(
2025-05-07T20:32:30.9958389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.9959050Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.9959643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.9959877Z 
2025-05-07T20:32:30.9960090Z self = <triton.compiler.compiler.ASTSource object at 0x7fb0178caba0>
2025-05-07T20:32:30.9961164Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.9962526Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10201cd60>}
2025-05-07T20:32:30.9963868Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.9964929Z context = <triton._C.libtriton.ir.context object at 0x7fb0174b1e70>
2025-05-07T20:32:30.9965231Z 
2025-05-07T20:32:30.9965403Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.9965935Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.9966406Z                            module_map=module_map)
2025-05-07T20:32:30.9966787Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.9967154Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.9967429Z E       ^
2025-05-07T20:32:30.9967897Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.9968352Z 
2025-05-07T20:32:30.9968770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.9969281Z 
2025-05-07T20:32:30.9969393Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.9969821Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.9970225Z     T=128,
2025-05-07T20:32:30.9970418Z     D=5120,
2025-05-07T20:32:30.9970626Z     scale_ub=None,
2025-05-07T20:32:30.9970842Z     contiguous=False,
2025-05-07T20:32:30.9971081Z     compiled=False,
2025-05-07T20:32:30.9971298Z )
2025-05-07T20:32:30.9971620Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.9972122Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:30.9972393Z 
2025-05-07T20:32:30.9972479Z     @given(
2025-05-07T20:32:30.9972721Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.9973112Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.9973447Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.9973871Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.9974358Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.9974686Z     )
2025-05-07T20:32:30.9975056Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.9975508Z     def test_silu_mul_quant(
2025-05-07T20:32:30.9975772Z         self,
2025-05-07T20:32:30.9975996Z         T: int,
2025-05-07T20:32:30.9976211Z         D: int,
2025-05-07T20:32:30.9976450Z         scale_ub: Optional[float],
2025-05-07T20:32:30.9976804Z         contiguous: bool,
2025-05-07T20:32:30.9977061Z         compiled: bool,
2025-05-07T20:32:30.9977305Z     ) -> None:
2025-05-07T20:32:30.9977542Z         torch.manual_seed(2025)
2025-05-07T20:32:30.9977799Z     
2025-05-07T20:32:30.9978090Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.9978447Z     
2025-05-07T20:32:30.9978656Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.9978971Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.9979308Z         x = x_sign * x_clamp
2025-05-07T20:32:30.9979575Z         x0 = x[:, :D]
2025-05-07T20:32:30.9979812Z         x1 = x[:, D:]
2025-05-07T20:32:30.9980045Z     
2025-05-07T20:32:30.9980262Z         if contiguous:
2025-05-07T20:32:30.9980515Z             x0 = x0.contiguous()
2025-05-07T20:32:30.9980796Z             x1 = x1.contiguous()
2025-05-07T20:32:30.9981061Z     
2025-05-07T20:32:30.9981276Z         if scale_ub is not None:
2025-05-07T20:32:30.9981570Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.9981927Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.9982249Z             )
2025-05-07T20:32:30.9982461Z         else:
2025-05-07T20:32:30.9982694Z             scale_ub_tensor = None
2025-05-07T20:32:30.9982957Z     
2025-05-07T20:32:30.9983215Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.9983548Z             op = silu_mul_quant
2025-05-07T20:32:30.9983816Z             if compiled:
2025-05-07T20:32:30.9984127Z                 op = torch.compile(op)
2025-05-07T20:32:30.9984455Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.9984760Z     
2025-05-07T20:32:30.9984968Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.9985147Z 
2025-05-07T20:32:30.9985254Z moe/activation_test.py:117: 
2025-05-07T20:32:30.9985568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.9985907Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.9986215Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.9986912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.9987617Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.9988170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.9988869Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.9989541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.9990080Z     kernel = self.compile(
2025-05-07T20:32:30.9990637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.9991300Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.9991707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.9991939Z 
2025-05-07T20:32:30.9992154Z self = <triton.compiler.compiler.ASTSource object at 0x7fb0178c9a30>
2025-05-07T20:32:30.9993227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.9994717Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10201e2a0>}
2025-05-07T20:32:30.9996058Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.9997076Z context = <triton._C.libtriton.ir.context object at 0x7fb017369ff0>
2025-05-07T20:32:30.9997430Z 
2025-05-07T20:32:30.9997600Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.9998137Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.9998614Z                            module_map=module_map)
2025-05-07T20:32:30.9998992Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.9999369Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.9999652Z E       ^
2025-05-07T20:32:31.0000152Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.0000609Z 
2025-05-07T20:32:31.0001027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.0001555Z 
2025-05-07T20:32:31.0001668Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.0002097Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.0002526Z     T=128,
2025-05-07T20:32:31.0002726Z     D=5120,
2025-05-07T20:32:31.0002943Z     scale_ub=1200.0,
2025-05-07T20:32:31.0003192Z     contiguous=True,
2025-05-07T20:32:31.0003431Z     compiled=False,
2025-05-07T20:32:31.0003662Z )
2025-05-07T20:32:31.1751813Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.1752398Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:31.1752696Z 
2025-05-07T20:32:31.1752786Z     @given(
2025-05-07T20:32:31.1753051Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.1753383Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.1753703Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.1754046Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.1754401Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.1754746Z     )
2025-05-07T20:32:31.1755110Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.1755572Z     def test_silu_mul_quant(
2025-05-07T20:32:31.1755830Z         self,
2025-05-07T20:32:31.1756032Z         T: int,
2025-05-07T20:32:31.1756246Z         D: int,
2025-05-07T20:32:31.1756476Z         scale_ub: Optional[float],
2025-05-07T20:32:31.1756758Z         contiguous: bool,
2025-05-07T20:32:31.1757012Z         compiled: bool,
2025-05-07T20:32:31.1757251Z     ) -> None:
2025-05-07T20:32:31.1757473Z         torch.manual_seed(2025)
2025-05-07T20:32:31.1757727Z     
2025-05-07T20:32:31.1758020Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.1758367Z     
2025-05-07T20:32:31.1758573Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.1758882Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.1759684Z         x = x_sign * x_clamp
2025-05-07T20:32:31.1759947Z         x0 = x[:, :D]
2025-05-07T20:32:31.1760171Z         x1 = x[:, D:]
2025-05-07T20:32:31.1760382Z     
2025-05-07T20:32:31.1760584Z         if contiguous:
2025-05-07T20:32:31.1760830Z             x0 = x0.contiguous()
2025-05-07T20:32:31.1761107Z             x1 = x1.contiguous()
2025-05-07T20:32:31.1761349Z     
2025-05-07T20:32:31.1761554Z         if scale_ub is not None:
2025-05-07T20:32:31.1761837Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.1762175Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.1762656Z             )
2025-05-07T20:32:31.1762864Z         else:
2025-05-07T20:32:31.1763238Z             scale_ub_tensor = None
2025-05-07T20:32:31.1763501Z     
2025-05-07T20:32:31.1763752Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.1764070Z             op = silu_mul_quant
2025-05-07T20:32:31.1764327Z             if compiled:
2025-05-07T20:32:31.1764591Z                 op = torch.compile(op)
2025-05-07T20:32:31.1764930Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.1765223Z     
2025-05-07T20:32:31.1765504Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.1765670Z 
2025-05-07T20:32:31.1765771Z moe/activation_test.py:117: 
2025-05-07T20:32:31.1766088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.1766430Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.1766723Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.1767418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.1768194Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.1768796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.1780010Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.1780703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.1781251Z     kernel = self.compile(
2025-05-07T20:32:31.1781794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.1782450Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.1782853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.1783085Z 
2025-05-07T20:32:31.1783292Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017323260>
2025-05-07T20:32:31.1784375Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.1785746Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10201f380>}
2025-05-07T20:32:31.1787079Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.1788097Z context = <triton._C.libtriton.ir.context object at 0x7fb0173ce5b0>
2025-05-07T20:32:31.1788383Z 
2025-05-07T20:32:31.1788561Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.1789077Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.1789553Z                            module_map=module_map)
2025-05-07T20:32:31.1789930Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.1790282Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.1790550Z E       ^
2025-05-07T20:32:31.1791021Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.1791469Z 
2025-05-07T20:32:31.1791887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.1792395Z 
2025-05-07T20:32:31.1792500Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.1792916Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.1793320Z     T=1,
2025-05-07T20:32:31.1793601Z     D=7168,
2025-05-07T20:32:31.1793804Z     scale_ub=1200.0,
2025-05-07T20:32:31.1794036Z     contiguous=True,
2025-05-07T20:32:31.1794259Z     compiled=True,
2025-05-07T20:32:31.1794616Z )
2025-05-07T20:32:31.1794944Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.1795429Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:31.1795698Z 
2025-05-07T20:32:31.1795780Z     @given(
2025-05-07T20:32:31.1796023Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.1796342Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.1796697Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.1797033Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.1797371Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.1797651Z     )
2025-05-07T20:32:31.1798006Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.1798457Z     def test_silu_mul_quant(
2025-05-07T20:32:31.1798705Z         self,
2025-05-07T20:32:31.1798917Z         T: int,
2025-05-07T20:32:31.1799135Z         D: int,
2025-05-07T20:32:31.1799362Z         scale_ub: Optional[float],
2025-05-07T20:32:31.1799635Z         contiguous: bool,
2025-05-07T20:32:31.1799887Z         compiled: bool,
2025-05-07T20:32:31.1800121Z     ) -> None:
2025-05-07T20:32:31.1800348Z         torch.manual_seed(2025)
2025-05-07T20:32:31.1800604Z     
2025-05-07T20:32:31.1800886Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.1801230Z     
2025-05-07T20:32:31.1801438Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.1801736Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.1802046Z         x = x_sign * x_clamp
2025-05-07T20:32:31.1802299Z         x0 = x[:, :D]
2025-05-07T20:32:31.1802526Z         x1 = x[:, D:]
2025-05-07T20:32:31.1802737Z     
2025-05-07T20:32:31.1802933Z         if contiguous:
2025-05-07T20:32:31.1803176Z             x0 = x0.contiguous()
2025-05-07T20:32:31.1803432Z             x1 = x1.contiguous()
2025-05-07T20:32:31.1803681Z     
2025-05-07T20:32:31.1803889Z         if scale_ub is not None:
2025-05-07T20:32:31.1804164Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.1804512Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.1804878Z             )
2025-05-07T20:32:31.1805080Z         else:
2025-05-07T20:32:31.1805294Z             scale_ub_tensor = None
2025-05-07T20:32:31.1805553Z     
2025-05-07T20:32:31.1805792Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.1806110Z             op = silu_mul_quant
2025-05-07T20:32:31.1806365Z             if compiled:
2025-05-07T20:32:31.1806620Z                 op = torch.compile(op)
2025-05-07T20:32:31.1806914Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.1807197Z     
2025-05-07T20:32:31.1807401Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.1807569Z 
2025-05-07T20:32:31.1807669Z moe/activation_test.py:117: 
2025-05-07T20:32:31.1807977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.1808314Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.1808604Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.1809158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:31.1809723Z     return fn(*args, **kwargs)
2025-05-07T20:32:31.1810383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.1811067Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.1811604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.1812280Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.1813061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.1813587Z     kernel = self.compile(
2025-05-07T20:32:31.1814222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.1814876Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.1815276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.1815510Z 
2025-05-07T20:32:31.1815718Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017321490>
2025-05-07T20:32:31.1816829Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.1818185Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb0173c4a40>}
2025-05-07T20:32:31.1819519Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.1820531Z context = <triton._C.libtriton.ir.context object at 0x7fb01751a9b0>
2025-05-07T20:32:31.1820824Z 
2025-05-07T20:32:31.1820993Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.1821513Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.1821987Z                            module_map=module_map)
2025-05-07T20:32:31.1822354Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.1822715Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.1822981Z E       ^
2025-05-07T20:32:31.1823441Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.1823890Z 
2025-05-07T20:32:31.1824359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.1824874Z 
2025-05-07T20:32:31.1824982Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.1825409Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.1825806Z     T=1,
2025-05-07T20:32:31.1826000Z     D=7168,
2025-05-07T20:32:31.1826203Z     scale_ub=1200.0,
2025-05-07T20:32:31.1826428Z     contiguous=False,
2025-05-07T20:32:31.1826664Z     compiled=True,
2025-05-07T20:32:31.1826881Z )
2025-05-07T20:32:31.3144852Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.3145400Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:31.3145681Z 
2025-05-07T20:32:31.3145763Z     @given(
2025-05-07T20:32:31.3146020Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.3146342Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.3146663Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.3147006Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.3147346Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.3147633Z     )
2025-05-07T20:32:31.3147995Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.3148448Z     def test_silu_mul_quant(
2025-05-07T20:32:31.3148699Z         self,
2025-05-07T20:32:31.3148909Z         T: int,
2025-05-07T20:32:31.3149119Z         D: int,
2025-05-07T20:32:31.3149346Z         scale_ub: Optional[float],
2025-05-07T20:32:31.3149628Z         contiguous: bool,
2025-05-07T20:32:31.3149878Z         compiled: bool,
2025-05-07T20:32:31.3150106Z     ) -> None:
2025-05-07T20:32:31.3150335Z         torch.manual_seed(2025)
2025-05-07T20:32:31.3150872Z     
2025-05-07T20:32:31.3151153Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.3151500Z     
2025-05-07T20:32:31.3151853Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.3152158Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.3152473Z         x = x_sign * x_clamp
2025-05-07T20:32:31.3152735Z         x0 = x[:, :D]
2025-05-07T20:32:31.3152966Z         x1 = x[:, D:]
2025-05-07T20:32:31.3153183Z     
2025-05-07T20:32:31.3153386Z         if contiguous:
2025-05-07T20:32:31.3153636Z             x0 = x0.contiguous()
2025-05-07T20:32:31.3153985Z             x1 = x1.contiguous()
2025-05-07T20:32:31.3154244Z     
2025-05-07T20:32:31.3154453Z         if scale_ub is not None:
2025-05-07T20:32:31.3154731Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.3155128Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.3155439Z             )
2025-05-07T20:32:31.3155636Z         else:
2025-05-07T20:32:31.3155865Z             scale_ub_tensor = None
2025-05-07T20:32:31.3156124Z     
2025-05-07T20:32:31.3156363Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.3156685Z             op = silu_mul_quant
2025-05-07T20:32:31.3156944Z             if compiled:
2025-05-07T20:32:31.3157206Z                 op = torch.compile(op)
2025-05-07T20:32:31.3157502Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.3157788Z     
2025-05-07T20:32:31.3157997Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.3158165Z 
2025-05-07T20:32:31.3158270Z moe/activation_test.py:117: 
2025-05-07T20:32:31.3158580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.3158917Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.3159437Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.3160042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:31.3160613Z     return fn(*args, **kwargs)
2025-05-07T20:32:31.3161281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.3161964Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.3162505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.3163187Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.3163855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.3164389Z     kernel = self.compile(
2025-05-07T20:32:31.3164936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.3165610Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.3166012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.3166250Z 
2025-05-07T20:32:31.3166461Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017322e40>
2025-05-07T20:32:31.3167543Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.3168927Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb0173c60c0>}
2025-05-07T20:32:31.3170262Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.3171275Z context = <triton._C.libtriton.ir.context object at 0x7fb017cbd9f0>
2025-05-07T20:32:31.3171657Z 
2025-05-07T20:32:31.3171827Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.3172490Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.3172960Z                            module_map=module_map)
2025-05-07T20:32:31.3173413Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.3173777Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.3174044Z E       ^
2025-05-07T20:32:31.3174508Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.3175027Z 
2025-05-07T20:32:31.3175442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.3175947Z 
2025-05-07T20:32:31.3176064Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.3176487Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.3176889Z     T=1,
2025-05-07T20:32:31.3177089Z     D=7168,
2025-05-07T20:32:31.3177295Z     scale_ub=None,
2025-05-07T20:32:31.3177517Z     contiguous=False,
2025-05-07T20:32:31.3177751Z     compiled=True,
2025-05-07T20:32:31.3177970Z )
2025-05-07T20:32:31.5845861Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.5846915Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:31.5847449Z 
2025-05-07T20:32:31.5847610Z     @given(
2025-05-07T20:32:31.5848071Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.5848714Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.5849320Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.5849974Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.5850621Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.5851179Z     )
2025-05-07T20:32:31.5851874Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.5852759Z     def test_silu_mul_quant(
2025-05-07T20:32:31.5853361Z         self,
2025-05-07T20:32:31.5853755Z         T: int,
2025-05-07T20:32:31.5854086Z         D: int,
2025-05-07T20:32:31.5854309Z         scale_ub: Optional[float],
2025-05-07T20:32:31.5854576Z         contiguous: bool,
2025-05-07T20:32:31.5854824Z         compiled: bool,
2025-05-07T20:32:31.5855056Z     ) -> None:
2025-05-07T20:32:31.5855271Z         torch.manual_seed(2025)
2025-05-07T20:32:31.5855525Z     
2025-05-07T20:32:31.5855804Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.5856144Z     
2025-05-07T20:32:31.5856342Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.5856647Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.5856954Z         x = x_sign * x_clamp
2025-05-07T20:32:31.5857203Z         x0 = x[:, :D]
2025-05-07T20:32:31.5857435Z         x1 = x[:, D:]
2025-05-07T20:32:31.5857647Z     
2025-05-07T20:32:31.5857844Z         if contiguous:
2025-05-07T20:32:31.5858088Z             x0 = x0.contiguous()
2025-05-07T20:32:31.5858357Z             x1 = x1.contiguous()
2025-05-07T20:32:31.5858609Z     
2025-05-07T20:32:31.5858811Z         if scale_ub is not None:
2025-05-07T20:32:31.5859095Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.5859665Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.5859986Z             )
2025-05-07T20:32:31.5860185Z         else:
2025-05-07T20:32:31.5860394Z             scale_ub_tensor = None
2025-05-07T20:32:31.5860656Z     
2025-05-07T20:32:31.5860896Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.5861211Z             op = silu_mul_quant
2025-05-07T20:32:31.5861465Z             if compiled:
2025-05-07T20:32:31.5861719Z                 op = torch.compile(op)
2025-05-07T20:32:31.5862011Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.5862287Z     
2025-05-07T20:32:31.5862785Z         y_fp8, y_scale = fn()
2025-05-07T20:32:31.5863066Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:31.5863359Z     
2025-05-07T20:32:31.5863752Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.5864092Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:31.5864384Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:31.5864698Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:31.5865061Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.5865367Z     
2025-05-07T20:32:31.5865668Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:31.5865862Z 
2025-05-07T20:32:31.5865970Z moe/activation_test.py:126: 
2025-05-07T20:32:31.5866266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.5866602Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:31.5866935Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.5867739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:31.5868485Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:31.5869034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.5869719Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.5870399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:31.5871120Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:31.5871843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:31.5872478Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:31.5873076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:31.5873586Z     fn()
2025-05-07T20:32:31.5874099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:31.5874672Z     self.fn.run(
2025-05-07T20:32:31.5875139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.5875668Z     kernel = self.compile(
2025-05-07T20:32:31.5876218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.5876870Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.5877275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.5877510Z 
2025-05-07T20:32:31.5877723Z self = <triton.compiler.compiler.ASTSource object at 0x7fb0175084a0>
2025-05-07T20:32:31.5878799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.5880168Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb0173c6de0>}
2025-05-07T20:32:31.5881494Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.5882519Z context = <triton._C.libtriton.ir.context object at 0x7fb017c7daf0>
2025-05-07T20:32:31.5882807Z 
2025-05-07T20:32:31.5882984Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.5883511Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.5884049Z                            module_map=module_map)
2025-05-07T20:32:31.5884560Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.5884933Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:31.5885204Z E       ^
2025-05-07T20:32:31.5885671Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.5886119Z 
2025-05-07T20:32:31.5886538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.5887086Z 
2025-05-07T20:32:31.5887204Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.5887621Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.5888029Z     T=1,
2025-05-07T20:32:31.5888223Z     D=5120,
2025-05-07T20:32:31.5888421Z     scale_ub=1200.0,
2025-05-07T20:32:31.5888661Z     contiguous=False,
2025-05-07T20:32:31.5888894Z     compiled=True,
2025-05-07T20:32:31.5889105Z )
2025-05-07T20:32:31.7421386Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.7421936Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:31.7422268Z 
2025-05-07T20:32:31.7422379Z     @given(
2025-05-07T20:32:31.7422705Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.7423043Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.7423375Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.7423731Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.7424076Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.7424374Z     )
2025-05-07T20:32:31.7424734Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.7425187Z     def test_silu_mul_quant(
2025-05-07T20:32:31.7425444Z         self,
2025-05-07T20:32:31.7425670Z         T: int,
2025-05-07T20:32:31.7425891Z         D: int,
2025-05-07T20:32:31.7426122Z         scale_ub: Optional[float],
2025-05-07T20:32:31.7426411Z         contiguous: bool,
2025-05-07T20:32:31.7426666Z         compiled: bool,
2025-05-07T20:32:31.7426904Z     ) -> None:
2025-05-07T20:32:31.7427136Z         torch.manual_seed(2025)
2025-05-07T20:32:31.7427397Z     
2025-05-07T20:32:31.7427675Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.7428036Z     
2025-05-07T20:32:31.7428249Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.7428566Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.7428887Z         x = x_sign * x_clamp
2025-05-07T20:32:31.7429147Z         x0 = x[:, :D]
2025-05-07T20:32:31.7429385Z         x1 = x[:, D:]
2025-05-07T20:32:31.7429613Z     
2025-05-07T20:32:31.7429818Z         if contiguous:
2025-05-07T20:32:31.7430072Z             x0 = x0.contiguous()
2025-05-07T20:32:31.7430347Z             x1 = x1.contiguous()
2025-05-07T20:32:31.7430615Z     
2025-05-07T20:32:31.7430829Z         if scale_ub is not None:
2025-05-07T20:32:31.7431129Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.7431491Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.7431819Z             )
2025-05-07T20:32:31.7432021Z         else:
2025-05-07T20:32:31.7432249Z             scale_ub_tensor = None
2025-05-07T20:32:31.7432516Z     
2025-05-07T20:32:31.7432756Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.7433080Z             op = silu_mul_quant
2025-05-07T20:32:31.7433348Z             if compiled:
2025-05-07T20:32:31.7433613Z                 op = torch.compile(op)
2025-05-07T20:32:31.7433918Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.7434207Z     
2025-05-07T20:32:31.7434413Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.7434581Z 
2025-05-07T20:32:31.7434686Z moe/activation_test.py:117: 
2025-05-07T20:32:31.7435277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.7435615Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.7436047Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.7436617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:31.7437181Z     return fn(*args, **kwargs)
2025-05-07T20:32:31.7437844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.7438618Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.7439164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.7439846Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.7440504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.7441043Z     kernel = self.compile(
2025-05-07T20:32:31.7441595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.7442246Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.7442660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.7442893Z 
2025-05-07T20:32:31.7443112Z self = <triton.compiler.compiler.ASTSource object at 0x7fb0175081d0>
2025-05-07T20:32:31.7444178Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.7445551Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017c68540>}
2025-05-07T20:32:31.7446889Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.7447908Z context = <triton._C.libtriton.ir.context object at 0x7fb017ccdc70>
2025-05-07T20:32:31.7448199Z 
2025-05-07T20:32:31.7448492Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.7449118Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.7449699Z                            module_map=module_map)
2025-05-07T20:32:31.7450374Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.7450834Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.7451178Z E       ^
2025-05-07T20:32:31.7451799Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.7452318Z 
2025-05-07T20:32:31.7452772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.7453431Z 
2025-05-07T20:32:31.7453598Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.7454212Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.7454687Z     T=1,
2025-05-07T20:32:31.7454988Z     D=5120,
2025-05-07T20:32:31.7455356Z     scale_ub=1200.0,
2025-05-07T20:32:31.7465146Z     contiguous=False,
2025-05-07T20:32:31.7465416Z     compiled=False,
2025-05-07T20:32:31.7465694Z )
2025-05-07T20:32:31.7466023Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.7466514Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:31.7466784Z 
2025-05-07T20:32:31.7466864Z     @given(
2025-05-07T20:32:31.7467099Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.7467540Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.7467842Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.7468297Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.7468628Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.7468908Z     )
2025-05-07T20:32:31.7469255Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.7469691Z     def test_silu_mul_quant(
2025-05-07T20:32:31.7469928Z         self,
2025-05-07T20:32:31.7470124Z         T: int,
2025-05-07T20:32:31.7470390Z         D: int,
2025-05-07T20:32:31.7470604Z         scale_ub: Optional[float],
2025-05-07T20:32:31.7470876Z         contiguous: bool,
2025-05-07T20:32:31.7471112Z         compiled: bool,
2025-05-07T20:32:31.7471331Z     ) -> None:
2025-05-07T20:32:31.7471545Z         torch.manual_seed(2025)
2025-05-07T20:32:31.7471782Z     
2025-05-07T20:32:31.7472050Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.7472400Z     
2025-05-07T20:32:31.7472596Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.7472898Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.7473202Z         x = x_sign * x_clamp
2025-05-07T20:32:31.7473450Z         x0 = x[:, :D]
2025-05-07T20:32:31.7473675Z         x1 = x[:, D:]
2025-05-07T20:32:31.7473883Z     
2025-05-07T20:32:31.7474075Z         if contiguous:
2025-05-07T20:32:31.7474315Z             x0 = x0.contiguous()
2025-05-07T20:32:31.7474567Z             x1 = x1.contiguous()
2025-05-07T20:32:31.7474817Z     
2025-05-07T20:32:31.7475017Z         if scale_ub is not None:
2025-05-07T20:32:31.7475289Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.7475630Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.7475940Z             )
2025-05-07T20:32:31.7476126Z         else:
2025-05-07T20:32:31.7476340Z             scale_ub_tensor = None
2025-05-07T20:32:31.7476590Z     
2025-05-07T20:32:31.7476820Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.7477133Z             op = silu_mul_quant
2025-05-07T20:32:31.7477389Z             if compiled:
2025-05-07T20:32:31.7477637Z                 op = torch.compile(op)
2025-05-07T20:32:31.7477926Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.7478199Z     
2025-05-07T20:32:31.7478399Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.7478563Z 
2025-05-07T20:32:31.7478664Z moe/activation_test.py:117: 
2025-05-07T20:32:31.7478960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.7479297Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.7479567Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.7480254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.7480939Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.7481476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.7482155Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.7482815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.7483340Z     kernel = self.compile(
2025-05-07T20:32:31.7483871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.7484518Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.7484955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.7485202Z 
2025-05-07T20:32:31.7485417Z self = <triton.compiler.compiler.ASTSource object at 0x7fb0178983b0>
2025-05-07T20:32:31.7486476Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.7487982Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017dbb560>}
2025-05-07T20:32:31.7489308Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.7490354Z context = <triton._C.libtriton.ir.context object at 0x7fb017e4fd30>
2025-05-07T20:32:31.7490635Z 
2025-05-07T20:32:31.7490806Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.7491311Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.7491775Z                            module_map=module_map)
2025-05-07T20:32:31.7492149Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.7492493Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.7492777Z E       ^
2025-05-07T20:32:31.7493305Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.7493747Z 
2025-05-07T20:32:31.7494162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.7494666Z 
2025-05-07T20:32:31.7494784Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.7495231Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.7495628Z     T=16384,
2025-05-07T20:32:31.7495828Z     D=5120,
2025-05-07T20:32:31.7496028Z     scale_ub=1200.0,
2025-05-07T20:32:31.7496257Z     contiguous=False,
2025-05-07T20:32:31.7496487Z     compiled=True,
2025-05-07T20:32:31.7496688Z )
2025-05-07T20:32:31.8368242Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.8368866Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:31.8369255Z 
2025-05-07T20:32:31.8369378Z     @given(
2025-05-07T20:32:31.8369692Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.8370125Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.8370459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.8370793Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.8371122Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.8371416Z     )
2025-05-07T20:32:31.8371766Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.8372200Z     def test_silu_mul_quant(
2025-05-07T20:32:31.8372446Z         self,
2025-05-07T20:32:31.8372644Z         T: int,
2025-05-07T20:32:31.8372842Z         D: int,
2025-05-07T20:32:31.8373165Z         scale_ub: Optional[float],
2025-05-07T20:32:31.8373443Z         contiguous: bool,
2025-05-07T20:32:31.8373682Z         compiled: bool,
2025-05-07T20:32:31.8373917Z     ) -> None:
2025-05-07T20:32:31.8374137Z         torch.manual_seed(2025)
2025-05-07T20:32:31.8374374Z     
2025-05-07T20:32:31.8374649Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.8374997Z     
2025-05-07T20:32:31.8375190Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.8375485Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.8375801Z         x = x_sign * x_clamp
2025-05-07T20:32:31.8376050Z         x0 = x[:, :D]
2025-05-07T20:32:31.8376263Z         x1 = x[:, D:]
2025-05-07T20:32:31.8376483Z     
2025-05-07T20:32:31.8376679Z         if contiguous:
2025-05-07T20:32:31.8376913Z             x0 = x0.contiguous()
2025-05-07T20:32:31.8377177Z             x1 = x1.contiguous()
2025-05-07T20:32:31.8377418Z     
2025-05-07T20:32:31.8377615Z         if scale_ub is not None:
2025-05-07T20:32:31.8378199Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.8378540Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.8378988Z             )
2025-05-07T20:32:31.8379193Z         else:
2025-05-07T20:32:31.8379407Z             scale_ub_tensor = None
2025-05-07T20:32:31.8379659Z     
2025-05-07T20:32:31.8379897Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.8380214Z             op = silu_mul_quant
2025-05-07T20:32:31.8380465Z             if compiled:
2025-05-07T20:32:31.8380719Z                 op = torch.compile(op)
2025-05-07T20:32:31.8381104Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.8381388Z     
2025-05-07T20:32:31.8381587Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.8381760Z 
2025-05-07T20:32:31.8381865Z moe/activation_test.py:117: 
2025-05-07T20:32:31.8382173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.8382508Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.8382803Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.8383372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:31.8383923Z     return fn(*args, **kwargs)
2025-05-07T20:32:31.8384588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.8385275Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.8385814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.8386487Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.8387152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.8387684Z     kernel = self.compile(
2025-05-07T20:32:31.8388221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.8388881Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.8389286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.8389513Z 
2025-05-07T20:32:31.8389730Z self = <triton.compiler.compiler.ASTSource object at 0x7fb01789b920>
2025-05-07T20:32:31.8390787Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.8392157Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb1024c85e0>}
2025-05-07T20:32:31.8393480Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.8394501Z context = <triton._C.libtriton.ir.context object at 0x7fb017c154b0>
2025-05-07T20:32:31.8394783Z 
2025-05-07T20:32:31.8394957Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.8395467Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.8395933Z                            module_map=module_map)
2025-05-07T20:32:31.8396298Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.8396652Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.8396915Z E       ^
2025-05-07T20:32:31.8397384Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.8397827Z 
2025-05-07T20:32:31.8398245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.8398801Z 
2025-05-07T20:32:31.8398907Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.8399439Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.8399840Z     T=2048,
2025-05-07T20:32:31.8400027Z     D=7168,
2025-05-07T20:32:31.8400227Z     scale_ub=1200.0,
2025-05-07T20:32:31.8400457Z     contiguous=False,
2025-05-07T20:32:31.8400688Z     compiled=True,
2025-05-07T20:32:31.8400911Z )
2025-05-07T20:32:31.8401236Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.8401781Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:31.8402051Z 
2025-05-07T20:32:31.8402138Z     @given(
2025-05-07T20:32:31.8402366Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.8402697Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.8402999Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.8403330Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.8403661Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.8403951Z     )
2025-05-07T20:32:31.8404303Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.8404773Z     def test_silu_mul_quant(
2025-05-07T20:32:31.8405035Z         self,
2025-05-07T20:32:31.8405237Z         T: int,
2025-05-07T20:32:31.8405444Z         D: int,
2025-05-07T20:32:31.8405663Z         scale_ub: Optional[float],
2025-05-07T20:32:31.8405938Z         contiguous: bool,
2025-05-07T20:32:31.8406188Z         compiled: bool,
2025-05-07T20:32:31.8406411Z     ) -> None:
2025-05-07T20:32:31.8406632Z         torch.manual_seed(2025)
2025-05-07T20:32:31.8406883Z     
2025-05-07T20:32:31.8407156Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.8407501Z     
2025-05-07T20:32:31.8407702Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.8407999Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.8408307Z         x = x_sign * x_clamp
2025-05-07T20:32:31.8408561Z         x0 = x[:, :D]
2025-05-07T20:32:31.8408789Z         x1 = x[:, D:]
2025-05-07T20:32:31.8408999Z     
2025-05-07T20:32:31.8409196Z         if contiguous:
2025-05-07T20:32:31.8409434Z             x0 = x0.contiguous()
2025-05-07T20:32:31.8409691Z             x1 = x1.contiguous()
2025-05-07T20:32:31.8409939Z     
2025-05-07T20:32:31.8410139Z         if scale_ub is not None:
2025-05-07T20:32:31.8410416Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.8410771Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.8411081Z             )
2025-05-07T20:32:31.8411274Z         else:
2025-05-07T20:32:31.8411493Z             scale_ub_tensor = None
2025-05-07T20:32:31.8411752Z     
2025-05-07T20:32:31.8411982Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.8412295Z             op = silu_mul_quant
2025-05-07T20:32:31.8412554Z             if compiled:
2025-05-07T20:32:31.8412804Z                 op = torch.compile(op)
2025-05-07T20:32:31.8413167Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.8413444Z     
2025-05-07T20:32:31.8413642Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.8413806Z 
2025-05-07T20:32:31.8413907Z moe/activation_test.py:117: 
2025-05-07T20:32:31.8414241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.8414606Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.8414884Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.8415444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:31.8416008Z     return fn(*args, **kwargs)
2025-05-07T20:32:31.8416665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.8417395Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.8418015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.8418692Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.8419347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.8419881Z     kernel = self.compile(
2025-05-07T20:32:31.8420424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.8421118Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.8421511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.8421743Z 
2025-05-07T20:32:31.8421952Z self = <triton.compiler.compiler.ASTSource object at 0x7fb01789baa0>
2025-05-07T20:32:31.8423026Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.8424409Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10313a8e0>}
2025-05-07T20:32:31.8425756Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.8426775Z context = <triton._C.libtriton.ir.context object at 0x7fb017744730>
2025-05-07T20:32:31.8427071Z 
2025-05-07T20:32:31.8427239Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.8427765Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.8428234Z                            module_map=module_map)
2025-05-07T20:32:31.8428606Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.8428976Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.8429241Z E       ^
2025-05-07T20:32:31.8429705Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.8430159Z 
2025-05-07T20:32:31.8430572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.8431079Z 
2025-05-07T20:32:31.9594944Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.9596251Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.9597344Z     T=1,
2025-05-07T20:32:31.9597863Z     D=5120,
2025-05-07T20:32:31.9598260Z     scale_ub=None,
2025-05-07T20:32:31.9598698Z     contiguous=False,
2025-05-07T20:32:31.9599155Z     compiled=False,
2025-05-07T20:32:31.9599600Z )
2025-05-07T20:32:31.9600249Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.9601227Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:31.9601745Z 
2025-05-07T20:32:31.9601898Z     @given(
2025-05-07T20:32:31.9602363Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.9602986Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.9603604Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.9604251Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.9604916Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.9605283Z     )
2025-05-07T20:32:31.9605650Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.9606094Z     def test_silu_mul_quant(
2025-05-07T20:32:31.9606341Z         self,
2025-05-07T20:32:31.9606540Z         T: int,
2025-05-07T20:32:31.9607005Z         D: int,
2025-05-07T20:32:31.9607238Z         scale_ub: Optional[float],
2025-05-07T20:32:31.9607516Z         contiguous: bool,
2025-05-07T20:32:31.9607913Z         compiled: bool,
2025-05-07T20:32:31.9608156Z     ) -> None:
2025-05-07T20:32:31.9608374Z         torch.manual_seed(2025)
2025-05-07T20:32:31.9608622Z     
2025-05-07T20:32:31.9608911Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.9609248Z     
2025-05-07T20:32:31.9609459Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.9609757Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.9610147Z         x = x_sign * x_clamp
2025-05-07T20:32:31.9610393Z         x0 = x[:, :D]
2025-05-07T20:32:31.9610615Z         x1 = x[:, D:]
2025-05-07T20:32:31.9610828Z     
2025-05-07T20:32:31.9611015Z         if contiguous:
2025-05-07T20:32:31.9611257Z             x0 = x0.contiguous()
2025-05-07T20:32:31.9611527Z             x1 = x1.contiguous()
2025-05-07T20:32:31.9611774Z     
2025-05-07T20:32:31.9611981Z         if scale_ub is not None:
2025-05-07T20:32:31.9612263Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.9612610Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.9612922Z             )
2025-05-07T20:32:31.9613224Z         else:
2025-05-07T20:32:31.9613432Z             scale_ub_tensor = None
2025-05-07T20:32:31.9613697Z     
2025-05-07T20:32:31.9613934Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.9614244Z             op = silu_mul_quant
2025-05-07T20:32:31.9614507Z             if compiled:
2025-05-07T20:32:31.9614768Z                 op = torch.compile(op)
2025-05-07T20:32:31.9615067Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.9615338Z     
2025-05-07T20:32:31.9615537Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.9615702Z 
2025-05-07T20:32:31.9615810Z moe/activation_test.py:117: 
2025-05-07T20:32:31.9616104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.9616451Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.9616746Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.9617433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.9618115Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.9618650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.9619328Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.9619981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.9620509Z     kernel = self.compile(
2025-05-07T20:32:31.9621047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.9621694Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.9622093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.9622327Z 
2025-05-07T20:32:31.9622534Z self = <triton.compiler.compiler.ASTSource object at 0x7fb1024b8e90>
2025-05-07T20:32:31.9623603Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.9624966Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10333c7c0>}
2025-05-07T20:32:31.9626282Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.9627359Z context = <triton._C.libtriton.ir.context object at 0x7fb016f00bb0>
2025-05-07T20:32:31.9627656Z 
2025-05-07T20:32:31.9627904Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.9628425Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.9628884Z                            module_map=module_map)
2025-05-07T20:32:31.9629254Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.9629612Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.9629866Z E       ^
2025-05-07T20:32:31.9630373Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.9630826Z 
2025-05-07T20:32:31.9631238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.9631743Z 
2025-05-07T20:32:31.9631857Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.9632269Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.9632678Z     T=4096,
2025-05-07T20:32:31.9632882Z     D=7168,
2025-05-07T20:32:31.9633081Z     scale_ub=1200.0,
2025-05-07T20:32:31.9633314Z     contiguous=False,
2025-05-07T20:32:31.9633546Z     compiled=False,
2025-05-07T20:32:31.9633747Z )
2025-05-07T20:32:31.9634069Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.9634561Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:31.9634837Z 
2025-05-07T20:32:31.9634926Z     @given(
2025-05-07T20:32:31.9635157Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.9635477Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.9635790Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.9636116Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.9636449Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.9636746Z     )
2025-05-07T20:32:31.9637098Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.9637547Z     def test_silu_mul_quant(
2025-05-07T20:32:31.9637799Z         self,
2025-05-07T20:32:31.9638003Z         T: int,
2025-05-07T20:32:31.9638206Z         D: int,
2025-05-07T20:32:31.9638434Z         scale_ub: Optional[float],
2025-05-07T20:32:31.9638721Z         contiguous: bool,
2025-05-07T20:32:31.9638968Z         compiled: bool,
2025-05-07T20:32:31.9639199Z     ) -> None:
2025-05-07T20:32:31.9639427Z         torch.manual_seed(2025)
2025-05-07T20:32:31.9639674Z     
2025-05-07T20:32:31.9639954Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.9640304Z     
2025-05-07T20:32:31.9640500Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.9640799Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.9641116Z         x = x_sign * x_clamp
2025-05-07T20:32:31.9641365Z         x0 = x[:, :D]
2025-05-07T20:32:31.9641587Z         x1 = x[:, D:]
2025-05-07T20:32:31.9641802Z     
2025-05-07T20:32:31.9641998Z         if contiguous:
2025-05-07T20:32:31.9642250Z             x0 = x0.contiguous()
2025-05-07T20:32:31.9642520Z             x1 = x1.contiguous()
2025-05-07T20:32:31.9642767Z     
2025-05-07T20:32:31.9642969Z         if scale_ub is not None:
2025-05-07T20:32:31.9643262Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.9643611Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.9643920Z             )
2025-05-07T20:32:31.9644136Z         else:
2025-05-07T20:32:31.9644368Z             scale_ub_tensor = None
2025-05-07T20:32:31.9644622Z     
2025-05-07T20:32:31.9644869Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.9645187Z             op = silu_mul_quant
2025-05-07T20:32:31.9645441Z             if compiled:
2025-05-07T20:32:31.9645704Z                 op = torch.compile(op)
2025-05-07T20:32:31.9646061Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.9646335Z     
2025-05-07T20:32:31.9646543Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.9646820Z 
2025-05-07T20:32:31.9646928Z moe/activation_test.py:117: 
2025-05-07T20:32:31.9656984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.9657340Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.9657628Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.9658325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.9659095Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.9659912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.9660580Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.9661235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.9661766Z     kernel = self.compile(
2025-05-07T20:32:31.9662311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.9662954Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.9663348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.9663586Z 
2025-05-07T20:32:31.9663799Z self = <triton.compiler.compiler.ASTSource object at 0x7fb102babec0>
2025-05-07T20:32:31.9664869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.9666221Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb1085b5620>}
2025-05-07T20:32:31.9667544Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.9668550Z context = <triton._C.libtriton.ir.context object at 0x7fb01705ff70>
2025-05-07T20:32:31.9668832Z 
2025-05-07T20:32:31.9669002Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.9669510Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.9669977Z                            module_map=module_map)
2025-05-07T20:32:31.9670349Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.9670705Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.9670963Z E       ^
2025-05-07T20:32:31.9671423Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.9671869Z 
2025-05-07T20:32:31.9672290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.9672794Z 
2025-05-07T20:32:31.9672906Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.9673318Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.9673726Z     T=16384,
2025-05-07T20:32:31.9673926Z     D=7168,
2025-05-07T20:32:31.9674118Z     scale_ub=None,
2025-05-07T20:32:31.9674343Z     contiguous=True,
2025-05-07T20:32:31.9674610Z     compiled=True,
2025-05-07T20:32:31.9674825Z )
2025-05-07T20:32:32.1432023Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.1432779Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:32.1433160Z 
2025-05-07T20:32:32.1433271Z     @given(
2025-05-07T20:32:32.1433907Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.1434330Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.1434934Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.1435377Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.1435722Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.1436004Z     )
2025-05-07T20:32:32.1436359Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.1436805Z     def test_silu_mul_quant(
2025-05-07T20:32:32.1437132Z         self,
2025-05-07T20:32:32.1437335Z         T: int,
2025-05-07T20:32:32.1437543Z         D: int,
2025-05-07T20:32:32.1437761Z         scale_ub: Optional[float],
2025-05-07T20:32:32.1438042Z         contiguous: bool,
2025-05-07T20:32:32.1438291Z         compiled: bool,
2025-05-07T20:32:32.1438530Z     ) -> None:
2025-05-07T20:32:32.1438746Z         torch.manual_seed(2025)
2025-05-07T20:32:32.1439002Z     
2025-05-07T20:32:32.1439282Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.1439620Z     
2025-05-07T20:32:32.1439828Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.1440127Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.1440437Z         x = x_sign * x_clamp
2025-05-07T20:32:32.1440688Z         x0 = x[:, :D]
2025-05-07T20:32:32.1440912Z         x1 = x[:, D:]
2025-05-07T20:32:32.1441116Z     
2025-05-07T20:32:32.1441311Z         if contiguous:
2025-05-07T20:32:32.1441558Z             x0 = x0.contiguous()
2025-05-07T20:32:32.1441820Z             x1 = x1.contiguous()
2025-05-07T20:32:32.1442070Z     
2025-05-07T20:32:32.1442273Z         if scale_ub is not None:
2025-05-07T20:32:32.1442554Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.1442903Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.1443235Z             )
2025-05-07T20:32:32.1443450Z         else:
2025-05-07T20:32:32.1443672Z             scale_ub_tensor = None
2025-05-07T20:32:32.1443925Z     
2025-05-07T20:32:32.1444160Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.1444477Z             op = silu_mul_quant
2025-05-07T20:32:32.1444731Z             if compiled:
2025-05-07T20:32:32.1444986Z                 op = torch.compile(op)
2025-05-07T20:32:32.1445277Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.1445561Z     
2025-05-07T20:32:32.1445763Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.1445929Z 
2025-05-07T20:32:32.1446030Z moe/activation_test.py:117: 
2025-05-07T20:32:32.1446338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.1446669Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.1446953Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.1447511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.1448073Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.1448739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.1449420Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.1449957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.1450637Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.1451294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.1451824Z     kernel = self.compile(
2025-05-07T20:32:32.1452372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.1453169Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.1453573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.1453864Z 
2025-05-07T20:32:32.1454075Z self = <triton.compiler.compiler.ASTSource object at 0x7fb102ba9b20>
2025-05-07T20:32:32.1455279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.1456654Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10852c900>}
2025-05-07T20:32:32.1458030Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.1459047Z context = <triton._C.libtriton.ir.context object at 0x7fb017015cb0>
2025-05-07T20:32:32.1459701Z 
2025-05-07T20:32:32.1459876Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.1460401Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.1460870Z                            module_map=module_map)
2025-05-07T20:32:32.1461229Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.1461588Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.1461854Z E       ^
2025-05-07T20:32:32.1462316Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.1462768Z 
2025-05-07T20:32:32.1463184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.1463699Z 
2025-05-07T20:32:32.1463807Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.1464230Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.1464635Z     T=4096,
2025-05-07T20:32:32.1464841Z     D=5120,
2025-05-07T20:32:32.1465065Z     scale_ub=None,
2025-05-07T20:32:32.1465337Z     contiguous=False,
2025-05-07T20:32:32.1465572Z     compiled=True,
2025-05-07T20:32:32.1465790Z )
2025-05-07T20:32:32.1466116Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.1466605Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:32.1466877Z 
2025-05-07T20:32:32.1466956Z     @given(
2025-05-07T20:32:32.1467190Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.1467513Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.1467819Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.1468149Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.1468487Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.1468773Z     )
2025-05-07T20:32:32.1469133Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.1469579Z     def test_silu_mul_quant(
2025-05-07T20:32:32.1469823Z         self,
2025-05-07T20:32:32.1470030Z         T: int,
2025-05-07T20:32:32.1470237Z         D: int,
2025-05-07T20:32:32.1470458Z         scale_ub: Optional[float],
2025-05-07T20:32:32.1470738Z         contiguous: bool,
2025-05-07T20:32:32.1470989Z         compiled: bool,
2025-05-07T20:32:32.1471209Z     ) -> None:
2025-05-07T20:32:32.1471427Z         torch.manual_seed(2025)
2025-05-07T20:32:32.1471679Z     
2025-05-07T20:32:32.1471954Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.1472304Z     
2025-05-07T20:32:32.1472502Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.1472801Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.1473109Z         x = x_sign * x_clamp
2025-05-07T20:32:32.1473360Z         x0 = x[:, :D]
2025-05-07T20:32:32.1473587Z         x1 = x[:, D:]
2025-05-07T20:32:32.1473898Z     
2025-05-07T20:32:32.1474093Z         if contiguous:
2025-05-07T20:32:32.1474345Z             x0 = x0.contiguous()
2025-05-07T20:32:32.1474766Z             x1 = x1.contiguous()
2025-05-07T20:32:32.1475022Z     
2025-05-07T20:32:32.1475223Z         if scale_ub is not None:
2025-05-07T20:32:32.1475499Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.1475836Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.1476152Z             )
2025-05-07T20:32:32.1476351Z         else:
2025-05-07T20:32:32.1476568Z             scale_ub_tensor = None
2025-05-07T20:32:32.1476891Z     
2025-05-07T20:32:32.1477119Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.1477454Z             op = silu_mul_quant
2025-05-07T20:32:32.1477703Z             if compiled:
2025-05-07T20:32:32.1477959Z                 op = torch.compile(op)
2025-05-07T20:32:32.1478262Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.1478545Z     
2025-05-07T20:32:32.1478743Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.1478915Z 
2025-05-07T20:32:32.1479018Z moe/activation_test.py:117: 
2025-05-07T20:32:32.1479329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.1479655Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.1479940Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.1480496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.1481045Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.1481702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.1482386Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.1482925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.1483595Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.1484257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.1484797Z     kernel = self.compile(
2025-05-07T20:32:32.1485391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.1486032Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.1486429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.1486653Z 
2025-05-07T20:32:32.1486869Z self = <triton.compiler.compiler.ASTSource object at 0x7fb1031d1850>
2025-05-07T20:32:32.1487925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.1489281Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017a034c0>}
2025-05-07T20:32:32.1490608Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.1491616Z context = <triton._C.libtriton.ir.context object at 0x7fb01701e4f0>
2025-05-07T20:32:32.1491899Z 
2025-05-07T20:32:32.1492072Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.1492584Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.1493115Z                            module_map=module_map)
2025-05-07T20:32:32.1493484Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.1493832Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.1494097Z E       ^
2025-05-07T20:32:32.1494612Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.1495186Z 
2025-05-07T20:32:32.1495634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.1496137Z 
2025-05-07T20:32:32.2971408Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.2972038Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.2972628Z     T=4096,
2025-05-07T20:32:32.2973241Z     D=5120,
2025-05-07T20:32:32.2973443Z     scale_ub=1200.0,
2025-05-07T20:32:32.2973663Z     contiguous=False,
2025-05-07T20:32:32.2973893Z     compiled=False,
2025-05-07T20:32:32.2974110Z )
2025-05-07T20:32:32.2974439Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.2974931Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.2975216Z 
2025-05-07T20:32:32.2975295Z     @given(
2025-05-07T20:32:32.2975533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.2975855Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.2976170Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.2976505Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.2976832Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.2977125Z     )
2025-05-07T20:32:32.2977480Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.2977931Z     def test_silu_mul_quant(
2025-05-07T20:32:32.2978168Z         self,
2025-05-07T20:32:32.2978369Z         T: int,
2025-05-07T20:32:32.2978571Z         D: int,
2025-05-07T20:32:32.2978794Z         scale_ub: Optional[float],
2025-05-07T20:32:32.2979071Z         contiguous: bool,
2025-05-07T20:32:32.2979321Z         compiled: bool,
2025-05-07T20:32:32.2979547Z     ) -> None:
2025-05-07T20:32:32.2979773Z         torch.manual_seed(2025)
2025-05-07T20:32:32.2980020Z     
2025-05-07T20:32:32.2980295Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.2980640Z     
2025-05-07T20:32:32.2980844Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.2981131Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.2981439Z         x = x_sign * x_clamp
2025-05-07T20:32:32.2981686Z         x0 = x[:, :D]
2025-05-07T20:32:32.2981904Z         x1 = x[:, D:]
2025-05-07T20:32:32.2982131Z     
2025-05-07T20:32:32.2982337Z         if contiguous:
2025-05-07T20:32:32.2982571Z             x0 = x0.contiguous()
2025-05-07T20:32:32.2982834Z             x1 = x1.contiguous()
2025-05-07T20:32:32.2983075Z     
2025-05-07T20:32:32.2983270Z         if scale_ub is not None:
2025-05-07T20:32:32.2983542Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.2983881Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.2984193Z             )
2025-05-07T20:32:32.2984386Z         else:
2025-05-07T20:32:32.2984600Z             scale_ub_tensor = None
2025-05-07T20:32:32.2984859Z     
2025-05-07T20:32:32.2985093Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.2985412Z             op = silu_mul_quant
2025-05-07T20:32:32.2985666Z             if compiled:
2025-05-07T20:32:32.2985911Z                 op = torch.compile(op)
2025-05-07T20:32:32.2986209Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.2986487Z     
2025-05-07T20:32:32.2986685Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.2986856Z 
2025-05-07T20:32:32.2986960Z moe/activation_test.py:117: 
2025-05-07T20:32:32.2987264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.2987599Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.2987877Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.2988561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.2989343Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.2990005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.2990693Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.2991359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.2991888Z     kernel = self.compile(
2025-05-07T20:32:32.2992427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.2993130Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.2993528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.2993754Z 
2025-05-07T20:32:32.2993968Z self = <triton.compiler.compiler.ASTSource object at 0x7fb1031d1d30>
2025-05-07T20:32:32.2995088Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.2996455Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017a02840>}
2025-05-07T20:32:32.2997778Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.2998793Z context = <triton._C.libtriton.ir.context object at 0x7fb017667df0>
2025-05-07T20:32:32.2999078Z 
2025-05-07T20:32:32.2999247Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.2999773Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.3000242Z                            module_map=module_map)
2025-05-07T20:32:32.3000617Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.3000972Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.3001240Z E       ^
2025-05-07T20:32:32.3001710Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.3002153Z 
2025-05-07T20:32:32.3002566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.3003081Z 
2025-05-07T20:32:32.3003187Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.3003605Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.3004008Z     T=4096,
2025-05-07T20:32:32.3004198Z     D=5120,
2025-05-07T20:32:32.3004395Z     scale_ub=1200.0,
2025-05-07T20:32:32.3004646Z     contiguous=False,
2025-05-07T20:32:32.3004900Z     compiled=True,
2025-05-07T20:32:32.3005111Z )
2025-05-07T20:32:32.3005440Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.3005929Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:32.3006204Z 
2025-05-07T20:32:32.3006284Z     @given(
2025-05-07T20:32:32.3006524Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.3006843Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.3007146Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.3007482Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.3007811Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.3008093Z     )
2025-05-07T20:32:32.3008440Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.3008882Z     def test_silu_mul_quant(
2025-05-07T20:32:32.3009176Z         self,
2025-05-07T20:32:32.3009377Z         T: int,
2025-05-07T20:32:32.3009580Z         D: int,
2025-05-07T20:32:32.3009878Z         scale_ub: Optional[float],
2025-05-07T20:32:32.3010159Z         contiguous: bool,
2025-05-07T20:32:32.3010406Z         compiled: bool,
2025-05-07T20:32:32.3010630Z     ) -> None:
2025-05-07T20:32:32.3010856Z         torch.manual_seed(2025)
2025-05-07T20:32:32.3011105Z     
2025-05-07T20:32:32.3011378Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.3011732Z     
2025-05-07T20:32:32.3011936Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.3012289Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.3012603Z         x = x_sign * x_clamp
2025-05-07T20:32:32.3012859Z         x0 = x[:, :D]
2025-05-07T20:32:32.3013159Z         x1 = x[:, D:]
2025-05-07T20:32:32.3013373Z     
2025-05-07T20:32:32.3013572Z         if contiguous:
2025-05-07T20:32:32.3013813Z             x0 = x0.contiguous()
2025-05-07T20:32:32.3014074Z             x1 = x1.contiguous()
2025-05-07T20:32:32.3014324Z     
2025-05-07T20:32:32.3014524Z         if scale_ub is not None:
2025-05-07T20:32:32.3014804Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.3015145Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.3015457Z             )
2025-05-07T20:32:32.3015646Z         else:
2025-05-07T20:32:32.3015863Z             scale_ub_tensor = None
2025-05-07T20:32:32.3016118Z     
2025-05-07T20:32:32.3016346Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.3016661Z             op = silu_mul_quant
2025-05-07T20:32:32.3016916Z             if compiled:
2025-05-07T20:32:32.3017168Z                 op = torch.compile(op)
2025-05-07T20:32:32.3017464Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.3017748Z     
2025-05-07T20:32:32.3017947Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.3018111Z 
2025-05-07T20:32:32.3018211Z moe/activation_test.py:117: 
2025-05-07T20:32:32.3018522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.3018855Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.3019139Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.3019697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.3020258Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.3020918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.3021603Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.3022140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.3022817Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.3023472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.3024006Z     kernel = self.compile(
2025-05-07T20:32:32.3024553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.3025206Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.3025600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.3025833Z 
2025-05-07T20:32:32.3026038Z self = <triton.compiler.compiler.ASTSource object at 0x7fb102e12ae0>
2025-05-07T20:32:32.3027105Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.3028459Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017a016c0>}
2025-05-07T20:32:32.3029906Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.3030938Z context = <triton._C.libtriton.ir.context object at 0x7fb016f78ff0>
2025-05-07T20:32:32.3031232Z 
2025-05-07T20:32:32.3031403Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.3031926Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.3032432Z                            module_map=module_map)
2025-05-07T20:32:32.3032806Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.3033168Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.3033439Z E       ^
2025-05-07T20:32:32.3033911Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.3034371Z 
2025-05-07T20:32:32.3034791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.3035292Z 
2025-05-07T20:32:32.4186856Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.4187498Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.4187926Z     T=2048,
2025-05-07T20:32:32.4188136Z     D=7168,
2025-05-07T20:32:32.4188343Z     scale_ub=1200.0,
2025-05-07T20:32:32.4188593Z     contiguous=False,
2025-05-07T20:32:32.4188874Z     compiled=False,
2025-05-07T20:32:32.4189100Z )
2025-05-07T20:32:32.4189605Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.4200611Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.4200940Z 
2025-05-07T20:32:32.4201024Z     @given(
2025-05-07T20:32:32.4201269Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.4201595Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.4201901Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.4202244Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.4202574Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.4202867Z     )
2025-05-07T20:32:32.4203214Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.4203668Z     def test_silu_mul_quant(
2025-05-07T20:32:32.4203919Z         self,
2025-05-07T20:32:32.4204112Z         T: int,
2025-05-07T20:32:32.4204321Z         D: int,
2025-05-07T20:32:32.4204547Z         scale_ub: Optional[float],
2025-05-07T20:32:32.4204816Z         contiguous: bool,
2025-05-07T20:32:32.4205066Z         compiled: bool,
2025-05-07T20:32:32.4205296Z     ) -> None:
2025-05-07T20:32:32.4205506Z         torch.manual_seed(2025)
2025-05-07T20:32:32.4205757Z     
2025-05-07T20:32:32.4206035Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.4206374Z     
2025-05-07T20:32:32.4206575Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.4206874Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.4207181Z         x = x_sign * x_clamp
2025-05-07T20:32:32.4207424Z         x0 = x[:, :D]
2025-05-07T20:32:32.4207652Z         x1 = x[:, D:]
2025-05-07T20:32:32.4207857Z     
2025-05-07T20:32:32.4208050Z         if contiguous:
2025-05-07T20:32:32.4208291Z             x0 = x0.contiguous()
2025-05-07T20:32:32.4208554Z             x1 = x1.contiguous()
2025-05-07T20:32:32.4208793Z     
2025-05-07T20:32:32.4208990Z         if scale_ub is not None:
2025-05-07T20:32:32.4209262Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.4209588Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.4209900Z             )
2025-05-07T20:32:32.4210097Z         else:
2025-05-07T20:32:32.4210307Z             scale_ub_tensor = None
2025-05-07T20:32:32.4210820Z     
2025-05-07T20:32:32.4211058Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.4211367Z             op = silu_mul_quant
2025-05-07T20:32:32.4211771Z             if compiled:
2025-05-07T20:32:32.4212029Z                 op = torch.compile(op)
2025-05-07T20:32:32.4212321Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.4212601Z     
2025-05-07T20:32:32.4212800Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.4212965Z 
2025-05-07T20:32:32.4213183Z moe/activation_test.py:117: 
2025-05-07T20:32:32.4213480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.4213896Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.4214178Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.4214859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.4215544Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.4216078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.4216759Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.4217419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.4217943Z     kernel = self.compile(
2025-05-07T20:32:32.4218482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.4219126Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.4219526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.4219763Z 
2025-05-07T20:32:32.4219973Z self = <triton.compiler.compiler.ASTSource object at 0x7fb1083b6570>
2025-05-07T20:32:32.4221047Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.4222407Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017e471a0>}
2025-05-07T20:32:32.4223735Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.4224749Z context = <triton._C.libtriton.ir.context object at 0x7fb017207670>
2025-05-07T20:32:32.4225033Z 
2025-05-07T20:32:32.4225205Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.4225721Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.4226181Z                            module_map=module_map)
2025-05-07T20:32:32.4226551Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.4226910Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.4227164Z E       ^
2025-05-07T20:32:32.4227625Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.4228066Z 
2025-05-07T20:32:32.4228482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.4228985Z 
2025-05-07T20:32:32.4229099Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.4229513Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.4229914Z     T=1,
2025-05-07T20:32:32.4230103Z     D=7168,
2025-05-07T20:32:32.4230295Z     scale_ub=None,
2025-05-07T20:32:32.4230510Z     contiguous=True,
2025-05-07T20:32:32.4230735Z     compiled=False,
2025-05-07T20:32:32.4230942Z )
2025-05-07T20:32:32.4231313Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.4231876Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:32.4232136Z 
2025-05-07T20:32:32.4232223Z     @given(
2025-05-07T20:32:32.4232454Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.4232776Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.4233089Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.4233414Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.4233789Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.4234076Z     )
2025-05-07T20:32:32.4234417Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.4234882Z     def test_silu_mul_quant(
2025-05-07T20:32:32.4235159Z         self,
2025-05-07T20:32:32.4235351Z         T: int,
2025-05-07T20:32:32.4235554Z         D: int,
2025-05-07T20:32:32.4235783Z         scale_ub: Optional[float],
2025-05-07T20:32:32.4236056Z         contiguous: bool,
2025-05-07T20:32:32.4236304Z         compiled: bool,
2025-05-07T20:32:32.4236540Z     ) -> None:
2025-05-07T20:32:32.4236752Z         torch.manual_seed(2025)
2025-05-07T20:32:32.4236998Z     
2025-05-07T20:32:32.4237274Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.4237622Z     
2025-05-07T20:32:32.4237820Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.4238120Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.4238443Z         x = x_sign * x_clamp
2025-05-07T20:32:32.4238692Z         x0 = x[:, :D]
2025-05-07T20:32:32.4238916Z         x1 = x[:, D:]
2025-05-07T20:32:32.4239129Z     
2025-05-07T20:32:32.4239319Z         if contiguous:
2025-05-07T20:32:32.4239560Z             x0 = x0.contiguous()
2025-05-07T20:32:32.4239824Z             x1 = x1.contiguous()
2025-05-07T20:32:32.4240068Z     
2025-05-07T20:32:32.4240270Z         if scale_ub is not None:
2025-05-07T20:32:32.4240552Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.4240887Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.4241202Z             )
2025-05-07T20:32:32.4241407Z         else:
2025-05-07T20:32:32.4241613Z             scale_ub_tensor = None
2025-05-07T20:32:32.4241873Z     
2025-05-07T20:32:32.4242111Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.4242428Z             op = silu_mul_quant
2025-05-07T20:32:32.4242677Z             if compiled:
2025-05-07T20:32:32.4242928Z                 op = torch.compile(op)
2025-05-07T20:32:32.4243228Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.4243503Z     
2025-05-07T20:32:32.4243701Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.4243867Z 
2025-05-07T20:32:32.4243976Z moe/activation_test.py:117: 
2025-05-07T20:32:32.4244269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.4244610Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.4244900Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.4245632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.4246320Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.4246857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.4247534Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.4248189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.4248727Z     kernel = self.compile(
2025-05-07T20:32:32.4249269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.4249917Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.4250358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.4250592Z 
2025-05-07T20:32:32.4250874Z self = <triton.compiler.compiler.ASTSource object at 0x7fb10834b980>
2025-05-07T20:32:32.4251935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.4253339Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017e45bc0>}
2025-05-07T20:32:32.4254780Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.4255784Z context = <triton._C.libtriton.ir.context object at 0x7fb017b8e6f0>
2025-05-07T20:32:32.4256076Z 
2025-05-07T20:32:32.4256242Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.4256771Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.4257240Z                            module_map=module_map)
2025-05-07T20:32:32.4257603Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.4257957Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.4258219Z E       ^
2025-05-07T20:32:32.4258681Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.4259133Z 
2025-05-07T20:32:32.4259838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.4260348Z 
2025-05-07T20:32:32.4260462Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.4260884Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.4261278Z     T=16384,
2025-05-07T20:32:32.4261473Z     D=7168,
2025-05-07T20:32:32.4261688Z     scale_ub=1200.0,
2025-05-07T20:32:32.4261915Z     contiguous=False,
2025-05-07T20:32:32.4262151Z     compiled=True,
2025-05-07T20:32:32.6665923Z )
2025-05-07T20:32:32.6666340Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.6666983Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:32.6667279Z 
2025-05-07T20:32:32.6667391Z     @given(
2025-05-07T20:32:32.6667621Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.6667947Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.6668260Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.6668590Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.6668918Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.6669210Z     )
2025-05-07T20:32:32.6669570Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.6670036Z     def test_silu_mul_quant(
2025-05-07T20:32:32.6670289Z         self,
2025-05-07T20:32:32.6670490Z         T: int,
2025-05-07T20:32:32.6670692Z         D: int,
2025-05-07T20:32:32.6670920Z         scale_ub: Optional[float],
2025-05-07T20:32:32.6671199Z         contiguous: bool,
2025-05-07T20:32:32.6671435Z         compiled: bool,
2025-05-07T20:32:32.6671673Z     ) -> None:
2025-05-07T20:32:32.6671894Z         torch.manual_seed(2025)
2025-05-07T20:32:32.6672139Z     
2025-05-07T20:32:32.6672417Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.6672760Z     
2025-05-07T20:32:32.6672954Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.6673249Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.6673561Z         x = x_sign * x_clamp
2025-05-07T20:32:32.6673801Z         x0 = x[:, :D]
2025-05-07T20:32:32.6674301Z         x1 = x[:, D:]
2025-05-07T20:32:32.6674509Z     
2025-05-07T20:32:32.6674695Z         if contiguous:
2025-05-07T20:32:32.6675126Z             x0 = x0.contiguous()
2025-05-07T20:32:32.6675399Z             x1 = x1.contiguous()
2025-05-07T20:32:32.6675653Z     
2025-05-07T20:32:32.6675846Z         if scale_ub is not None:
2025-05-07T20:32:32.6676118Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.6676453Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.6676756Z             )
2025-05-07T20:32:32.6676953Z         else:
2025-05-07T20:32:32.6677250Z             scale_ub_tensor = None
2025-05-07T20:32:32.6677500Z     
2025-05-07T20:32:32.6677736Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.6678049Z             op = silu_mul_quant
2025-05-07T20:32:32.6678294Z             if compiled:
2025-05-07T20:32:32.6678552Z                 op = torch.compile(op)
2025-05-07T20:32:32.6678851Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.6679152Z     
2025-05-07T20:32:32.6679350Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.6679513Z 
2025-05-07T20:32:32.6679620Z moe/activation_test.py:117: 
2025-05-07T20:32:32.6679919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.6680252Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.6680531Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.6681090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.6681649Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.6682301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.6682972Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.6683505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.6684180Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.6684845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.6685416Z     kernel = self.compile(
2025-05-07T20:32:32.6685953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.6686600Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.6686991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.6687231Z 
2025-05-07T20:32:32.6687437Z self = <triton.compiler.compiler.ASTSource object at 0x7fb103986900>
2025-05-07T20:32:32.6688503Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.6689874Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017e44b80>}
2025-05-07T20:32:32.6691196Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.6692205Z context = <triton._C.libtriton.ir.context object at 0x7fb0176331f0>
2025-05-07T20:32:32.6692507Z 
2025-05-07T20:32:32.6692672Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.6693306Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.6693773Z                            module_map=module_map)
2025-05-07T20:32:32.6694136Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.6694541Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.6694798Z E       ^
2025-05-07T20:32:32.6695329Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.6695782Z 
2025-05-07T20:32:32.6696196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.6696703Z 
2025-05-07T20:32:32.6696808Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.6697219Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.6697653Z     T=1,
2025-05-07T20:32:32.6697840Z     D=7168,
2025-05-07T20:32:32.6698041Z     scale_ub=None,
2025-05-07T20:32:32.6698256Z     contiguous=False,
2025-05-07T20:32:32.6698487Z     compiled=False,
2025-05-07T20:32:32.6698697Z )
2025-05-07T20:32:32.6699010Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.6699498Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:32.6699754Z 
2025-05-07T20:32:32.6699835Z     @given(
2025-05-07T20:32:32.6700069Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.6700378Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.6700683Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.6701014Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.6701337Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.6701627Z     )
2025-05-07T20:32:32.6701976Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.6702414Z     def test_silu_mul_quant(
2025-05-07T20:32:32.6702661Z         self,
2025-05-07T20:32:32.6702860Z         T: int,
2025-05-07T20:32:32.6703060Z         D: int,
2025-05-07T20:32:32.6703293Z         scale_ub: Optional[float],
2025-05-07T20:32:32.6703574Z         contiguous: bool,
2025-05-07T20:32:32.6703811Z         compiled: bool,
2025-05-07T20:32:32.6704044Z     ) -> None:
2025-05-07T20:32:32.6704263Z         torch.manual_seed(2025)
2025-05-07T20:32:32.6704502Z     
2025-05-07T20:32:32.6704788Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.6705131Z     
2025-05-07T20:32:32.6705336Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.6705624Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.6705938Z         x = x_sign * x_clamp
2025-05-07T20:32:32.6706195Z         x0 = x[:, :D]
2025-05-07T20:32:32.6706412Z         x1 = x[:, D:]
2025-05-07T20:32:32.6706634Z     
2025-05-07T20:32:32.6706829Z         if contiguous:
2025-05-07T20:32:32.6707061Z             x0 = x0.contiguous()
2025-05-07T20:32:32.6707324Z             x1 = x1.contiguous()
2025-05-07T20:32:32.6707571Z     
2025-05-07T20:32:32.6707762Z         if scale_ub is not None:
2025-05-07T20:32:32.6708039Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.6708379Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.6708693Z             )
2025-05-07T20:32:32.6708891Z         else:
2025-05-07T20:32:32.6709118Z             scale_ub_tensor = None
2025-05-07T20:32:32.6709370Z     
2025-05-07T20:32:32.6709611Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.6709929Z             op = silu_mul_quant
2025-05-07T20:32:32.6710187Z             if compiled:
2025-05-07T20:32:32.6710437Z                 op = torch.compile(op)
2025-05-07T20:32:32.6710743Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.6711027Z     
2025-05-07T20:32:32.6711223Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.6711394Z 
2025-05-07T20:32:32.6711496Z moe/activation_test.py:117: 
2025-05-07T20:32:32.6711801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.6712136Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.6712424Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.6713166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.6713928Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.6714460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.6715147Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.6715812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.6716376Z     kernel = self.compile(
2025-05-07T20:32:32.6716915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.6717559Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.6717956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.6718181Z 
2025-05-07T20:32:32.6718387Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017850cb0>
2025-05-07T20:32:32.6719456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.6720803Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10243b9c0>}
2025-05-07T20:32:32.6722124Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.6723136Z context = <triton._C.libtriton.ir.context object at 0x7fb0176499b0>
2025-05-07T20:32:32.6723419Z 
2025-05-07T20:32:32.6723585Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.6724106Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.6724576Z                            module_map=module_map)
2025-05-07T20:32:32.6724942Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.6725346Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.6725612Z E       ^
2025-05-07T20:32:32.6726079Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.6726526Z 
2025-05-07T20:32:32.6726934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.6727441Z 
2025-05-07T20:32:32.6727546Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.6727958Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.6728355Z     T=2048,
2025-05-07T20:32:32.6728554Z     D=7168,
2025-05-07T20:32:32.6728751Z     scale_ub=None,
2025-05-07T20:32:32.6728963Z     contiguous=False,
2025-05-07T20:32:32.6729190Z     compiled=True,
2025-05-07T20:32:32.6729398Z )
2025-05-07T20:32:32.7603187Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7604577Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:32.7605094Z 
2025-05-07T20:32:32.7605194Z     @given(
2025-05-07T20:32:32.7605449Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7605766Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7606094Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7606414Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7606745Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7607034Z     )
2025-05-07T20:32:32.7607381Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7608096Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7608339Z         self,
2025-05-07T20:32:32.7608529Z         T: int,
2025-05-07T20:32:32.7608891Z         D: int,
2025-05-07T20:32:32.7609116Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7609394Z         contiguous: bool,
2025-05-07T20:32:32.7609634Z         compiled: bool,
2025-05-07T20:32:32.7609862Z     ) -> None:
2025-05-07T20:32:32.7610085Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7610325Z     
2025-05-07T20:32:32.7610613Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7611035Z     
2025-05-07T20:32:32.7611238Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7611528Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7611842Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7612090Z         x0 = x[:, :D]
2025-05-07T20:32:32.7612305Z         x1 = x[:, D:]
2025-05-07T20:32:32.7612525Z     
2025-05-07T20:32:32.7612717Z         if contiguous:
2025-05-07T20:32:32.7612962Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7613325Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7613581Z     
2025-05-07T20:32:32.7613795Z         if scale_ub is not None:
2025-05-07T20:32:32.7614068Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7614406Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7614727Z             )
2025-05-07T20:32:32.7614923Z         else:
2025-05-07T20:32:32.7615149Z             scale_ub_tensor = None
2025-05-07T20:32:32.7615404Z     
2025-05-07T20:32:32.7615636Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7615956Z             op = silu_mul_quant
2025-05-07T20:32:32.7616508Z             if compiled:
2025-05-07T20:32:32.7616873Z                 op = torch.compile(op)
2025-05-07T20:32:32.7626473Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7626764Z     
2025-05-07T20:32:32.7626955Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7627134Z 
2025-05-07T20:32:32.7627235Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7627542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7627869Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7628148Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7628708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.7629256Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.7629905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7630583Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7631115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7631776Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7632434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7632963Z     kernel = self.compile(
2025-05-07T20:32:32.7633498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7634141Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7634540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7634804Z 
2025-05-07T20:32:32.7635019Z self = <triton.compiler.compiler.ASTSource object at 0x7fb0178528a0>
2025-05-07T20:32:32.7636080Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7637439Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb102439b20>}
2025-05-07T20:32:32.7638960Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7639972Z context = <triton._C.libtriton.ir.context object at 0x7fb1021e4b30>
2025-05-07T20:32:32.7640254Z 
2025-05-07T20:32:32.7640424Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7640976Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7641445Z                            module_map=module_map)
2025-05-07T20:32:32.7641812Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7642156Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7642422Z E       ^
2025-05-07T20:32:32.7642886Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7643331Z 
2025-05-07T20:32:32.7643750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7644249Z 
2025-05-07T20:32:32.7644353Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7644766Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7645166Z     T=4096,
2025-05-07T20:32:32.7645355Z     D=7168,
2025-05-07T20:32:32.7645553Z     scale_ub=None,
2025-05-07T20:32:32.7645776Z     contiguous=False,
2025-05-07T20:32:32.7646001Z     compiled=True,
2025-05-07T20:32:32.7646207Z )
2025-05-07T20:32:32.7646527Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7647022Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:32.7647287Z 
2025-05-07T20:32:32.7647363Z     @given(
2025-05-07T20:32:32.7647603Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7647921Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7648223Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7648554Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7648882Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7649159Z     )
2025-05-07T20:32:32.7649512Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7649947Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7650197Z         self,
2025-05-07T20:32:32.7650389Z         T: int,
2025-05-07T20:32:32.7650587Z         D: int,
2025-05-07T20:32:32.7650809Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7651070Z         contiguous: bool,
2025-05-07T20:32:32.7651313Z         compiled: bool,
2025-05-07T20:32:32.7651539Z     ) -> None:
2025-05-07T20:32:32.7651751Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7652004Z     
2025-05-07T20:32:32.7652274Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7652608Z     
2025-05-07T20:32:32.7652805Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7653165Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7653466Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7653704Z         x0 = x[:, :D]
2025-05-07T20:32:32.7653918Z         x1 = x[:, D:]
2025-05-07T20:32:32.7654121Z     
2025-05-07T20:32:32.7654315Z         if contiguous:
2025-05-07T20:32:32.7654546Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7654826Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7655095Z     
2025-05-07T20:32:32.7655292Z         if scale_ub is not None:
2025-05-07T20:32:32.7655561Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7655896Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7656204Z             )
2025-05-07T20:32:32.7656400Z         else:
2025-05-07T20:32:32.7656670Z             scale_ub_tensor = None
2025-05-07T20:32:32.7656929Z     
2025-05-07T20:32:32.7657241Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7657555Z             op = silu_mul_quant
2025-05-07T20:32:32.7657810Z             if compiled:
2025-05-07T20:32:32.7658064Z                 op = torch.compile(op)
2025-05-07T20:32:32.7658352Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7658631Z     
2025-05-07T20:32:32.7658826Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7658989Z 
2025-05-07T20:32:32.7659097Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7659857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7660189Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7660474Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7661019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.7661583Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.7662247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7662917Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7663448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7664115Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7664776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7665339Z     kernel = self.compile(
2025-05-07T20:32:32.7665872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7666519Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7666915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7667143Z 
2025-05-07T20:32:32.7667351Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017d3f8c0>
2025-05-07T20:32:32.7668424Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7669783Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10314f380>}
2025-05-07T20:32:32.7671107Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7672118Z context = <triton._C.libtriton.ir.context object at 0x7fb016ee22b0>
2025-05-07T20:32:32.7672401Z 
2025-05-07T20:32:32.7672569Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7673087Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7673550Z                            module_map=module_map)
2025-05-07T20:32:32.7673906Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7674257Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7674522Z E       ^
2025-05-07T20:32:32.7674987Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7675430Z 
2025-05-07T20:32:32.7675839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7676350Z 
2025-05-07T20:32:32.9255156Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.9255737Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.9256457Z     T=16384,
2025-05-07T20:32:32.9256654Z     D=5120,
2025-05-07T20:32:32.9256848Z     scale_ub=1200.0,
2025-05-07T20:32:32.9257230Z     contiguous=False,
2025-05-07T20:32:32.9257461Z     compiled=False,
2025-05-07T20:32:32.9257671Z )
2025-05-07T20:32:32.9257992Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.9258494Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.9258771Z 
2025-05-07T20:32:32.9258854Z     @given(
2025-05-07T20:32:32.9259082Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.9259769Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.9260083Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.9260406Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.9260734Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.9261028Z     )
2025-05-07T20:32:32.9261371Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.9261811Z     def test_silu_mul_quant(
2025-05-07T20:32:32.9262068Z         self,
2025-05-07T20:32:32.9262273Z         T: int,
2025-05-07T20:32:32.9262475Z         D: int,
2025-05-07T20:32:32.9262698Z         scale_ub: Optional[float],
2025-05-07T20:32:32.9262969Z         contiguous: bool,
2025-05-07T20:32:32.9263205Z         compiled: bool,
2025-05-07T20:32:32.9263433Z     ) -> None:
2025-05-07T20:32:32.9263652Z         torch.manual_seed(2025)
2025-05-07T20:32:32.9263887Z     
2025-05-07T20:32:32.9264159Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.9264502Z     
2025-05-07T20:32:32.9264693Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.9264986Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.9265297Z         x = x_sign * x_clamp
2025-05-07T20:32:32.9265537Z         x0 = x[:, :D]
2025-05-07T20:32:32.9265760Z         x1 = x[:, D:]
2025-05-07T20:32:32.9265976Z     
2025-05-07T20:32:32.9266160Z         if contiguous:
2025-05-07T20:32:32.9266393Z             x0 = x0.contiguous()
2025-05-07T20:32:32.9266664Z             x1 = x1.contiguous()
2025-05-07T20:32:32.9266906Z     
2025-05-07T20:32:32.9267103Z         if scale_ub is not None:
2025-05-07T20:32:32.9267378Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.9267715Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.9268014Z             )
2025-05-07T20:32:32.9268209Z         else:
2025-05-07T20:32:32.9268429Z             scale_ub_tensor = None
2025-05-07T20:32:32.9268684Z     
2025-05-07T20:32:32.9268918Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.9269240Z             op = silu_mul_quant
2025-05-07T20:32:32.9269484Z             if compiled:
2025-05-07T20:32:32.9269737Z                 op = torch.compile(op)
2025-05-07T20:32:32.9270035Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.9270303Z     
2025-05-07T20:32:32.9270517Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.9270682Z 
2025-05-07T20:32:32.9270788Z moe/activation_test.py:117: 
2025-05-07T20:32:32.9271102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.9271438Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.9271718Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.9272406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.9273084Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.9273629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.9274298Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.9274958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.9275566Z     kernel = self.compile(
2025-05-07T20:32:32.9276248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.9276903Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.9277301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.9277526Z 
2025-05-07T20:32:32.9277740Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017d3eae0>
2025-05-07T20:32:32.9278797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.9280222Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10314df80>}
2025-05-07T20:32:32.9281553Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.9282566Z context = <triton._C.libtriton.ir.context object at 0x7fb017b3ea30>
2025-05-07T20:32:32.9282849Z 
2025-05-07T20:32:32.9283017Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.9283524Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.9283988Z                            module_map=module_map)
2025-05-07T20:32:32.9284351Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.9284696Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.9284956Z E       ^
2025-05-07T20:32:32.9285413Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.9285858Z 
2025-05-07T20:32:32.9286274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.9286780Z 
2025-05-07T20:32:32.9286884Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.9287293Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.9287692Z     T=16384,
2025-05-07T20:32:32.9287883Z     D=5120,
2025-05-07T20:32:32.9288076Z     scale_ub=1200.0,
2025-05-07T20:32:32.9288299Z     contiguous=True,
2025-05-07T20:32:32.9288517Z     compiled=True,
2025-05-07T20:32:32.9288727Z )
2025-05-07T20:32:32.9289042Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.9289534Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:32.9289806Z 
2025-05-07T20:32:32.9289885Z     @given(
2025-05-07T20:32:32.9290118Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.9290431Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.9290738Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.9291080Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.9291412Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.9291695Z     )
2025-05-07T20:32:32.9292047Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.9292492Z     def test_silu_mul_quant(
2025-05-07T20:32:32.9292741Z         self,
2025-05-07T20:32:32.9292940Z         T: int,
2025-05-07T20:32:32.9293272Z         D: int,
2025-05-07T20:32:32.9293508Z         scale_ub: Optional[float],
2025-05-07T20:32:32.9293779Z         contiguous: bool,
2025-05-07T20:32:32.9294022Z         compiled: bool,
2025-05-07T20:32:32.9294248Z     ) -> None:
2025-05-07T20:32:32.9294467Z         torch.manual_seed(2025)
2025-05-07T20:32:32.9294715Z     
2025-05-07T20:32:32.9294988Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.9295374Z     
2025-05-07T20:32:32.9295582Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.9295954Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.9296264Z         x = x_sign * x_clamp
2025-05-07T20:32:32.9296513Z         x0 = x[:, :D]
2025-05-07T20:32:32.9296740Z         x1 = x[:, D:]
2025-05-07T20:32:32.9296953Z     
2025-05-07T20:32:32.9297145Z         if contiguous:
2025-05-07T20:32:32.9297389Z             x0 = x0.contiguous()
2025-05-07T20:32:32.9297653Z             x1 = x1.contiguous()
2025-05-07T20:32:32.9297905Z     
2025-05-07T20:32:32.9298148Z         if scale_ub is not None:
2025-05-07T20:32:32.9298425Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.9298757Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.9299062Z             )
2025-05-07T20:32:32.9299261Z         else:
2025-05-07T20:32:32.9299472Z             scale_ub_tensor = None
2025-05-07T20:32:32.9299725Z     
2025-05-07T20:32:32.9299959Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.9300268Z             op = silu_mul_quant
2025-05-07T20:32:32.9300523Z             if compiled:
2025-05-07T20:32:32.9300782Z                 op = torch.compile(op)
2025-05-07T20:32:32.9301072Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.9301364Z     
2025-05-07T20:32:32.9301563Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.9301726Z 
2025-05-07T20:32:32.9301833Z moe/activation_test.py:117: 
2025-05-07T20:32:32.9302125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.9302459Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.9302742Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.9303289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.9303840Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.9304492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.9305225Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.9305756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.9306430Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.9307080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.9307606Z     kernel = self.compile(
2025-05-07T20:32:32.9308142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.9308798Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.9309197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.9309420Z 
2025-05-07T20:32:32.9309624Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017dd1610>
2025-05-07T20:32:32.9310707Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.9312064Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10314c180>}
2025-05-07T20:32:32.9313383Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.9314396Z context = <triton._C.libtriton.ir.context object at 0x7fb017b772f0>
2025-05-07T20:32:32.9314684Z 
2025-05-07T20:32:32.9314851Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.9315416Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.9315954Z                            module_map=module_map)
2025-05-07T20:32:32.9316321Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.9316679Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.9316939Z E       ^
2025-05-07T20:32:32.9317403Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.9317843Z 
2025-05-07T20:32:32.9318251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.9318802Z 
2025-05-07T20:32:33.1025812Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.1026292Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.1026703Z     T=16384,
2025-05-07T20:32:33.1026906Z     D=5120,
2025-05-07T20:32:33.1027118Z     scale_ub=None,
2025-05-07T20:32:33.1027341Z     contiguous=False,
2025-05-07T20:32:33.1027575Z     compiled=True,
2025-05-07T20:32:33.1027781Z )
2025-05-07T20:32:33.1028122Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.1028626Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:33.1028901Z 
2025-05-07T20:32:33.1028989Z     @given(
2025-05-07T20:32:33.1029215Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.1029542Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.1029867Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.1030199Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.1030539Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.1030835Z     )
2025-05-07T20:32:33.1031177Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.1031617Z     def test_silu_mul_quant(
2025-05-07T20:32:33.1031872Z         self,
2025-05-07T20:32:33.1032067Z         T: int,
2025-05-07T20:32:33.1032273Z         D: int,
2025-05-07T20:32:33.1032500Z         scale_ub: Optional[float],
2025-05-07T20:32:33.1032774Z         contiguous: bool,
2025-05-07T20:32:33.1033021Z         compiled: bool,
2025-05-07T20:32:33.1033249Z     ) -> None:
2025-05-07T20:32:33.1033467Z         torch.manual_seed(2025)
2025-05-07T20:32:33.1033709Z     
2025-05-07T20:32:33.1033989Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.1034330Z     
2025-05-07T20:32:33.1034524Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.1034817Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.1035131Z         x = x_sign * x_clamp
2025-05-07T20:32:33.1035370Z         x0 = x[:, :D]
2025-05-07T20:32:33.1035595Z         x1 = x[:, D:]
2025-05-07T20:32:33.1035808Z     
2025-05-07T20:32:33.1035995Z         if contiguous:
2025-05-07T20:32:33.1036238Z             x0 = x0.contiguous()
2025-05-07T20:32:33.1036500Z             x1 = x1.contiguous()
2025-05-07T20:32:33.1036741Z     
2025-05-07T20:32:33.1036943Z         if scale_ub is not None:
2025-05-07T20:32:33.1037221Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.1037549Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.1037863Z             )
2025-05-07T20:32:33.1038065Z         else:
2025-05-07T20:32:33.1038281Z             scale_ub_tensor = None
2025-05-07T20:32:33.1038539Z     
2025-05-07T20:32:33.1038786Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.1039101Z             op = silu_mul_quant
2025-05-07T20:32:33.1039346Z             if compiled:
2025-05-07T20:32:33.1039601Z                 op = torch.compile(op)
2025-05-07T20:32:33.1039901Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.1040171Z     
2025-05-07T20:32:33.1040367Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.1040531Z 
2025-05-07T20:32:33.1040638Z moe/activation_test.py:117: 
2025-05-07T20:32:33.1041182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.1041662Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.1041949Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.1042509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.1043059Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.1043714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.1044476Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.1045007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.1045732Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.1046388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.1046919Z     kernel = self.compile(
2025-05-07T20:32:33.1047462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.1048130Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.1048523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.1048756Z 
2025-05-07T20:32:33.1048964Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017d9abd0>
2025-05-07T20:32:33.1050038Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.1051415Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017c68b80>}
2025-05-07T20:32:33.1052746Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.1053867Z context = <triton._C.libtriton.ir.context object at 0x7fb01712a0f0>
2025-05-07T20:32:33.1054158Z 
2025-05-07T20:32:33.1054323Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.1054840Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.1055311Z                            module_map=module_map)
2025-05-07T20:32:33.1055720Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.1056074Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.1056337Z E       ^
2025-05-07T20:32:33.1056801Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.1057251Z 
2025-05-07T20:32:33.1057673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.1058180Z 
2025-05-07T20:32:33.1058284Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.1058698Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.1059093Z     T=2048,
2025-05-07T20:32:33.1059522Z     D=5120,
2025-05-07T20:32:33.1059719Z     scale_ub=None,
2025-05-07T20:32:33.1059929Z     contiguous=False,
2025-05-07T20:32:33.1069098Z     compiled=True,
2025-05-07T20:32:33.1069326Z )
2025-05-07T20:32:33.1973665Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.1974227Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:33.1974503Z 
2025-05-07T20:32:33.1974583Z     @given(
2025-05-07T20:32:33.1974821Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.1975452Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.1975946Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.1976280Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.1976599Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.1976886Z     )
2025-05-07T20:32:33.1977235Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.1977674Z     def test_silu_mul_quant(
2025-05-07T20:32:33.1977918Z         self,
2025-05-07T20:32:33.1978201Z         T: int,
2025-05-07T20:32:33.1978410Z         D: int,
2025-05-07T20:32:33.1978632Z         scale_ub: Optional[float],
2025-05-07T20:32:33.1978912Z         contiguous: bool,
2025-05-07T20:32:33.1979155Z         compiled: bool,
2025-05-07T20:32:33.1979381Z     ) -> None:
2025-05-07T20:32:33.1979603Z         torch.manual_seed(2025)
2025-05-07T20:32:33.1979855Z     
2025-05-07T20:32:33.1980130Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.1980486Z     
2025-05-07T20:32:33.1980694Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.1980988Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.1981299Z         x = x_sign * x_clamp
2025-05-07T20:32:33.1981547Z         x0 = x[:, :D]
2025-05-07T20:32:33.1981760Z         x1 = x[:, D:]
2025-05-07T20:32:33.1981976Z     
2025-05-07T20:32:33.1982171Z         if contiguous:
2025-05-07T20:32:33.1982401Z             x0 = x0.contiguous()
2025-05-07T20:32:33.1982665Z             x1 = x1.contiguous()
2025-05-07T20:32:33.1982910Z     
2025-05-07T20:32:33.1983103Z         if scale_ub is not None:
2025-05-07T20:32:33.1983381Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.1983713Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.1984021Z             )
2025-05-07T20:32:33.1984211Z         else:
2025-05-07T20:32:33.1984424Z             scale_ub_tensor = None
2025-05-07T20:32:33.1984677Z     
2025-05-07T20:32:33.1984906Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.1985229Z             op = silu_mul_quant
2025-05-07T20:32:33.1985489Z             if compiled:
2025-05-07T20:32:33.1985783Z                 op = torch.compile(op)
2025-05-07T20:32:33.1986088Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.1986367Z     
2025-05-07T20:32:33.1986560Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.1986728Z 
2025-05-07T20:32:33.1986827Z moe/activation_test.py:117: 
2025-05-07T20:32:33.1987124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.1987455Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.1987744Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.1988304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.1988865Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.1989516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.1990197Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.1990737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.1991407Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.1992061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.1992599Z     kernel = self.compile(
2025-05-07T20:32:33.1993150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.1993793Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.1994201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.1994486Z 
2025-05-07T20:32:33.1994698Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017d9b710>
2025-05-07T20:32:33.1995843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.1997216Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017c6a0c0>}
2025-05-07T20:32:33.1998580Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.1999589Z context = <triton._C.libtriton.ir.context object at 0x7fb0171e16b0>
2025-05-07T20:32:33.1999870Z 
2025-05-07T20:32:33.2000043Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.2000564Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.2001035Z                            module_map=module_map)
2025-05-07T20:32:33.2001405Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.2001762Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.2002022Z E       ^
2025-05-07T20:32:33.2002494Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.2002941Z 
2025-05-07T20:32:33.2003359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.2003860Z 
2025-05-07T20:32:33.2003971Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.2004388Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.2004786Z     T=2048,
2025-05-07T20:32:33.2004979Z     D=5120,
2025-05-07T20:32:33.2005166Z     scale_ub=1200.0,
2025-05-07T20:32:33.2005427Z     contiguous=False,
2025-05-07T20:32:33.2005679Z     compiled=True,
2025-05-07T20:32:33.2005887Z )
2025-05-07T20:32:33.2006210Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.2006697Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:33.2006963Z 
2025-05-07T20:32:33.2007048Z     @given(
2025-05-07T20:32:33.2007277Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.2007604Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.2007911Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.2008241Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.2008582Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.2008873Z     )
2025-05-07T20:32:33.2009220Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.2009671Z     def test_silu_mul_quant(
2025-05-07T20:32:33.2009917Z         self,
2025-05-07T20:32:33.2010106Z         T: int,
2025-05-07T20:32:33.2010317Z         D: int,
2025-05-07T20:32:33.2010544Z         scale_ub: Optional[float],
2025-05-07T20:32:33.2010810Z         contiguous: bool,
2025-05-07T20:32:33.2011058Z         compiled: bool,
2025-05-07T20:32:33.2011283Z     ) -> None:
2025-05-07T20:32:33.2011497Z         torch.manual_seed(2025)
2025-05-07T20:32:33.2011736Z     
2025-05-07T20:32:33.2012009Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.2012351Z     
2025-05-07T20:32:33.2012541Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.2012830Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.2013246Z         x = x_sign * x_clamp
2025-05-07T20:32:33.2013491Z         x0 = x[:, :D]
2025-05-07T20:32:33.2013712Z         x1 = x[:, D:]
2025-05-07T20:32:33.2013921Z     
2025-05-07T20:32:33.2014157Z         if contiguous:
2025-05-07T20:32:33.2014393Z             x0 = x0.contiguous()
2025-05-07T20:32:33.2014661Z             x1 = x1.contiguous()
2025-05-07T20:32:33.2014985Z     
2025-05-07T20:32:33.2015197Z         if scale_ub is not None:
2025-05-07T20:32:33.2015504Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.2015862Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.2016183Z             )
2025-05-07T20:32:33.2016384Z         else:
2025-05-07T20:32:33.2016593Z             scale_ub_tensor = None
2025-05-07T20:32:33.2016856Z     
2025-05-07T20:32:33.2017141Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.2017457Z             op = silu_mul_quant
2025-05-07T20:32:33.2017711Z             if compiled:
2025-05-07T20:32:33.2017974Z                 op = torch.compile(op)
2025-05-07T20:32:33.2018277Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.2018547Z     
2025-05-07T20:32:33.2018742Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.2018907Z 
2025-05-07T20:32:33.2019018Z moe/activation_test.py:117: 
2025-05-07T20:32:33.2019323Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.2019656Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.2019941Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.2020491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.2021046Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.2021709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.2022395Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.2022922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.2023598Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.2024260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.2024850Z     kernel = self.compile(
2025-05-07T20:32:33.2025386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.2026038Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.2026447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.2026674Z 
2025-05-07T20:32:33.2026879Z self = <triton.compiler.compiler.ASTSource object at 0x7fb102cb0080>
2025-05-07T20:32:33.2027945Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.2029295Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017c6b2e0>}
2025-05-07T20:32:33.2030625Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.2031639Z context = <triton._C.libtriton.ir.context object at 0x7fb0171e49b0>
2025-05-07T20:32:33.2031926Z 
2025-05-07T20:32:33.2032092Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.2032619Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.2033092Z                            module_map=module_map)
2025-05-07T20:32:33.2033462Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.2033813Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.2034077Z E       ^
2025-05-07T20:32:33.2034600Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.2035042Z 
2025-05-07T20:32:33.2035611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.2036124Z 
2025-05-07T20:32:33.3784882Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.3785371Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.3785833Z     T=4096,
2025-05-07T20:32:33.3786037Z     D=5120,
2025-05-07T20:32:33.3786522Z     scale_ub=1200.0,
2025-05-07T20:32:33.3786758Z     contiguous=True,
2025-05-07T20:32:33.3786992Z     compiled=True,
2025-05-07T20:32:33.3787203Z )
2025-05-07T20:32:33.3787550Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.3788037Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:33.3788311Z 
2025-05-07T20:32:33.3788398Z     @given(
2025-05-07T20:32:33.3788632Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.3788956Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.3789270Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.3789604Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.3789938Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.3790221Z     )
2025-05-07T20:32:33.3790568Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.3791007Z     def test_silu_mul_quant(
2025-05-07T20:32:33.3791246Z         self,
2025-05-07T20:32:33.3791451Z         T: int,
2025-05-07T20:32:33.3791658Z         D: int,
2025-05-07T20:32:33.3791873Z         scale_ub: Optional[float],
2025-05-07T20:32:33.3792144Z         contiguous: bool,
2025-05-07T20:32:33.3792387Z         compiled: bool,
2025-05-07T20:32:33.3792614Z     ) -> None:
2025-05-07T20:32:33.3792834Z         torch.manual_seed(2025)
2025-05-07T20:32:33.3793079Z     
2025-05-07T20:32:33.3793343Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.3793687Z     
2025-05-07T20:32:33.3793885Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.3794171Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.3794485Z         x = x_sign * x_clamp
2025-05-07T20:32:33.3794735Z         x0 = x[:, :D]
2025-05-07T20:32:33.3794957Z         x1 = x[:, D:]
2025-05-07T20:32:33.3795163Z     
2025-05-07T20:32:33.3795354Z         if contiguous:
2025-05-07T20:32:33.3795596Z             x0 = x0.contiguous()
2025-05-07T20:32:33.3795905Z             x1 = x1.contiguous()
2025-05-07T20:32:33.3796149Z     
2025-05-07T20:32:33.3796345Z         if scale_ub is not None:
2025-05-07T20:32:33.3796614Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.3796952Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.3797264Z             )
2025-05-07T20:32:33.3797458Z         else:
2025-05-07T20:32:33.3797672Z             scale_ub_tensor = None
2025-05-07T20:32:33.3797926Z     
2025-05-07T20:32:33.3798158Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.3798471Z             op = silu_mul_quant
2025-05-07T20:32:33.3798730Z             if compiled:
2025-05-07T20:32:33.3798976Z                 op = torch.compile(op)
2025-05-07T20:32:33.3799272Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.3799551Z     
2025-05-07T20:32:33.3799746Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.3799907Z 
2025-05-07T20:32:33.3800010Z moe/activation_test.py:117: 
2025-05-07T20:32:33.3800308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.3800639Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.3800918Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.3801470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.3802121Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.3802900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.3803584Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.3804120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.3804792Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.3805438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.3806015Z     kernel = self.compile(
2025-05-07T20:32:33.3806552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.3807204Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.3807595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.3807833Z 
2025-05-07T20:32:33.3808044Z self = <triton.compiler.compiler.ASTSource object at 0x7fb102cb0830>
2025-05-07T20:32:33.3809108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.3810474Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb0172fc860>}
2025-05-07T20:32:33.3811793Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.3812804Z context = <triton._C.libtriton.ir.context object at 0x7fb0172cf9f0>
2025-05-07T20:32:33.3813198Z 
2025-05-07T20:32:33.3813366Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.3813885Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.3814341Z                            module_map=module_map)
2025-05-07T20:32:33.3814703Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.3815056Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.3815313Z E       ^
2025-05-07T20:32:33.3815822Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.3816274Z 
2025-05-07T20:32:33.3816686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.3817186Z 
2025-05-07T20:32:33.3817294Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.3817697Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.3818100Z     T=128,
2025-05-07T20:32:33.3818290Z     D=5120,
2025-05-07T20:32:33.3818482Z     scale_ub=1200.0,
2025-05-07T20:32:33.3818710Z     contiguous=False,
2025-05-07T20:32:33.3818938Z     compiled=True,
2025-05-07T20:32:33.3819146Z )
2025-05-07T20:32:33.6617821Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.6618497Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:33.6618788Z 
2025-05-07T20:32:33.6618876Z     @given(
2025-05-07T20:32:33.6619134Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.6619479Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.6619805Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.6620155Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.6620487Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.6620789Z     )
2025-05-07T20:32:33.6621335Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.6621783Z     def test_silu_mul_quant(
2025-05-07T20:32:33.6622205Z         self,
2025-05-07T20:32:33.6622419Z         T: int,
2025-05-07T20:32:33.6622619Z         D: int,
2025-05-07T20:32:33.6622850Z         scale_ub: Optional[float],
2025-05-07T20:32:33.6623131Z         contiguous: bool,
2025-05-07T20:32:33.6623381Z         compiled: bool,
2025-05-07T20:32:33.6623612Z     ) -> None:
2025-05-07T20:32:33.6623838Z         torch.manual_seed(2025)
2025-05-07T20:32:33.6624088Z     
2025-05-07T20:32:33.6624456Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.6624808Z     
2025-05-07T20:32:33.6625013Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.6625312Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.6625629Z         x = x_sign * x_clamp
2025-05-07T20:32:33.6625880Z         x0 = x[:, :D]
2025-05-07T20:32:33.6626100Z         x1 = x[:, D:]
2025-05-07T20:32:33.6626318Z     
2025-05-07T20:32:33.6626515Z         if contiguous:
2025-05-07T20:32:33.6626753Z             x0 = x0.contiguous()
2025-05-07T20:32:33.6627031Z             x1 = x1.contiguous()
2025-05-07T20:32:33.6627279Z     
2025-05-07T20:32:33.6627475Z         if scale_ub is not None:
2025-05-07T20:32:33.6627755Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.6628097Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.6628413Z             )
2025-05-07T20:32:33.6628608Z         else:
2025-05-07T20:32:33.6628826Z             scale_ub_tensor = None
2025-05-07T20:32:33.6629089Z     
2025-05-07T20:32:33.6629324Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.6629642Z             op = silu_mul_quant
2025-05-07T20:32:33.6629897Z             if compiled:
2025-05-07T20:32:33.6630151Z                 op = torch.compile(op)
2025-05-07T20:32:33.6630457Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.6630741Z     
2025-05-07T20:32:33.6630939Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.6631110Z 
2025-05-07T20:32:33.6631214Z moe/activation_test.py:117: 
2025-05-07T20:32:33.6631524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.6631857Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.6632147Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.6632714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.6633283Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.6633944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.6634632Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.6635172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.6635844Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.6636512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.6637046Z     kernel = self.compile(
2025-05-07T20:32:33.6637595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.6638239Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.6638641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.6638872Z 
2025-05-07T20:32:33.6639092Z self = <triton.compiler.compiler.ASTSource object at 0x7fb10246a660>
2025-05-07T20:32:33.6640175Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.6641674Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb0172fd580>}
2025-05-07T20:32:33.6643010Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.6644028Z context = <triton._C.libtriton.ir.context object at 0x7fb0172a67f0>
2025-05-07T20:32:33.6644314Z 
2025-05-07T20:32:33.6644490Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.6645049Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.6645523Z                            module_map=module_map)
2025-05-07T20:32:33.6645894Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.6646253Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.6646512Z E       ^
2025-05-07T20:32:33.6646981Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.6647427Z 
2025-05-07T20:32:33.6647850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.6648358Z 
2025-05-07T20:32:33.6648471Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.6648886Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.6649295Z     T=16384,
2025-05-07T20:32:33.6649499Z     D=7168,
2025-05-07T20:32:33.6649698Z     scale_ub=1200.0,
2025-05-07T20:32:33.6649936Z     contiguous=True,
2025-05-07T20:32:33.6650167Z     compiled=True,
2025-05-07T20:32:33.6650387Z )
2025-05-07T20:32:33.6650719Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.6651225Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:33.6651504Z 
2025-05-07T20:32:33.6651587Z     @given(
2025-05-07T20:32:33.6651840Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.6652169Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.6652485Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.6652822Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.6653269Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.6653568Z     )
2025-05-07T20:32:33.6653922Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.6654383Z     def test_silu_mul_quant(
2025-05-07T20:32:33.6654634Z         self,
2025-05-07T20:32:33.6654840Z         T: int,
2025-05-07T20:32:33.6655056Z         D: int,
2025-05-07T20:32:33.6655282Z         scale_ub: Optional[float],
2025-05-07T20:32:33.6655583Z         contiguous: bool,
2025-05-07T20:32:33.6655829Z         compiled: bool,
2025-05-07T20:32:33.6656071Z     ) -> None:
2025-05-07T20:32:33.6656298Z         torch.manual_seed(2025)
2025-05-07T20:32:33.6656547Z     
2025-05-07T20:32:33.6656844Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.6657204Z     
2025-05-07T20:32:33.6657423Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.6657731Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.6658053Z         x = x_sign * x_clamp
2025-05-07T20:32:33.6658306Z         x0 = x[:, :D]
2025-05-07T20:32:33.6658543Z         x1 = x[:, D:]
2025-05-07T20:32:33.6658773Z     
2025-05-07T20:32:33.6658972Z         if contiguous:
2025-05-07T20:32:33.6659430Z             x0 = x0.contiguous()
2025-05-07T20:32:33.6659722Z             x1 = x1.contiguous()
2025-05-07T20:32:33.6659977Z     
2025-05-07T20:32:33.6660183Z         if scale_ub is not None:
2025-05-07T20:32:33.6660475Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.6669626Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.6670066Z             )
2025-05-07T20:32:33.6670264Z         else:
2025-05-07T20:32:33.6670483Z             scale_ub_tensor = None
2025-05-07T20:32:33.6670870Z     
2025-05-07T20:32:33.6671111Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.6671432Z             op = silu_mul_quant
2025-05-07T20:32:33.6671690Z             if compiled:
2025-05-07T20:32:33.6671934Z                 op = torch.compile(op)
2025-05-07T20:32:33.6672235Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.6672511Z     
2025-05-07T20:32:33.6672704Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.6672940Z 
2025-05-07T20:32:33.6673042Z moe/activation_test.py:117: 
2025-05-07T20:32:33.6673345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.6673681Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.6673961Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.6674521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.6675087Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.6675743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.6676421Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.6676954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.6677632Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.6678291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.6678824Z     kernel = self.compile(
2025-05-07T20:32:33.6679373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.6680026Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.6680422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.6680662Z 
2025-05-07T20:32:33.6680874Z self = <triton.compiler.compiler.ASTSource object at 0x7fb102468860>
2025-05-07T20:32:33.6681941Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.6683299Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb0172fe0c0>}
2025-05-07T20:32:33.6684618Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.6685685Z context = <triton._C.libtriton.ir.context object at 0x7fb016da7570>
2025-05-07T20:32:33.6685980Z 
2025-05-07T20:32:33.6686148Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.6686666Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.6687123Z                            module_map=module_map)
2025-05-07T20:32:33.6687488Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.6687844Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.6688101Z E       ^
2025-05-07T20:32:33.6688568Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.6689020Z 
2025-05-07T20:32:33.6689429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.6689932Z 
2025-05-07T20:32:33.7908402Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.7909008Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.7909509Z     T=16384,
2025-05-07T20:32:33.7909942Z     D=5120,
2025-05-07T20:32:33.7910224Z     scale_ub=1200.0,
2025-05-07T20:32:33.7910530Z     contiguous=True,
2025-05-07T20:32:33.7910814Z     compiled=False,
2025-05-07T20:32:33.7911027Z )
2025-05-07T20:32:33.7911347Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.7911838Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:33.7912122Z 
2025-05-07T20:32:33.7912275Z     @given(
2025-05-07T20:32:33.7912511Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.7912823Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.7913133Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.7913465Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.7913790Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.7914084Z     )
2025-05-07T20:32:33.7914433Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.7914906Z     def test_silu_mul_quant(
2025-05-07T20:32:33.7915184Z         self,
2025-05-07T20:32:33.7915385Z         T: int,
2025-05-07T20:32:33.7915591Z         D: int,
2025-05-07T20:32:33.7915810Z         scale_ub: Optional[float],
2025-05-07T20:32:33.7916082Z         contiguous: bool,
2025-05-07T20:32:33.7916337Z         compiled: bool,
2025-05-07T20:32:33.7916563Z     ) -> None:
2025-05-07T20:32:33.7916785Z         torch.manual_seed(2025)
2025-05-07T20:32:33.7917030Z     
2025-05-07T20:32:33.7917304Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.7917652Z     
2025-05-07T20:32:33.7917853Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.7918141Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.7918452Z         x = x_sign * x_clamp
2025-05-07T20:32:33.7918705Z         x0 = x[:, :D]
2025-05-07T20:32:33.7918923Z         x1 = x[:, D:]
2025-05-07T20:32:33.7919136Z     
2025-05-07T20:32:33.7919329Z         if contiguous:
2025-05-07T20:32:33.7919566Z             x0 = x0.contiguous()
2025-05-07T20:32:33.7919832Z             x1 = x1.contiguous()
2025-05-07T20:32:33.7920073Z     
2025-05-07T20:32:33.7920273Z         if scale_ub is not None:
2025-05-07T20:32:33.7920547Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.7920884Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.7921192Z             )
2025-05-07T20:32:33.7921385Z         else:
2025-05-07T20:32:33.7921601Z             scale_ub_tensor = None
2025-05-07T20:32:33.7921859Z     
2025-05-07T20:32:33.7922084Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.7922400Z             op = silu_mul_quant
2025-05-07T20:32:33.7922651Z             if compiled:
2025-05-07T20:32:33.7922899Z                 op = torch.compile(op)
2025-05-07T20:32:33.7923208Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.7923488Z     
2025-05-07T20:32:33.7923679Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.7923856Z 
2025-05-07T20:32:33.7923955Z moe/activation_test.py:117: 
2025-05-07T20:32:33.7924254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.7924581Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.7924863Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.7925600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.7926289Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.7926819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.7927500Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.7928160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.7928742Z     kernel = self.compile(
2025-05-07T20:32:33.7929354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.7930013Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.7930410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.7930636Z 
2025-05-07T20:32:33.7930851Z self = <triton.compiler.compiler.ASTSource object at 0x7fb102b49c40>
2025-05-07T20:32:33.7931953Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.7933428Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb0172ff1a0>}
2025-05-07T20:32:33.7934776Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.7935845Z context = <triton._C.libtriton.ir.context object at 0x7fb016b8fc70>
2025-05-07T20:32:33.7936132Z 
2025-05-07T20:32:33.7936301Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.7936829Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.7937308Z                            module_map=module_map)
2025-05-07T20:32:33.7937682Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.7938033Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.7938302Z E       ^
2025-05-07T20:32:33.7938772Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.7939223Z 
2025-05-07T20:32:33.7939644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.7940167Z 
2025-05-07T20:32:33.7940279Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.7940809Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.7941267Z     T=1,
2025-05-07T20:32:33.7941471Z     D=7168,
2025-05-07T20:32:33.7941682Z     scale_ub=1200.0,
2025-05-07T20:32:33.7941937Z     contiguous=False,
2025-05-07T20:32:33.7942187Z     compiled=False,
2025-05-07T20:32:33.7942400Z )
2025-05-07T20:32:33.7942735Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.7943232Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:33.7943509Z 
2025-05-07T20:32:33.7943595Z     @given(
2025-05-07T20:32:33.7943836Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.7944164Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.7944504Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.7944838Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.7945181Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.7945480Z     )
2025-05-07T20:32:33.7945833Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.7946288Z     def test_silu_mul_quant(
2025-05-07T20:32:33.7946548Z         self,
2025-05-07T20:32:33.7946757Z         T: int,
2025-05-07T20:32:33.7946967Z         D: int,
2025-05-07T20:32:33.7947207Z         scale_ub: Optional[float],
2025-05-07T20:32:33.7947487Z         contiguous: bool,
2025-05-07T20:32:33.7947739Z         compiled: bool,
2025-05-07T20:32:33.7947974Z     ) -> None:
2025-05-07T20:32:33.7948202Z         torch.manual_seed(2025)
2025-05-07T20:32:33.7948460Z     
2025-05-07T20:32:33.7948805Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.7949159Z     
2025-05-07T20:32:33.7949474Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.7949787Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.7950114Z         x = x_sign * x_clamp
2025-05-07T20:32:33.7950362Z         x0 = x[:, :D]
2025-05-07T20:32:33.7950605Z         x1 = x[:, D:]
2025-05-07T20:32:33.7950830Z     
2025-05-07T20:32:33.7951032Z         if contiguous:
2025-05-07T20:32:33.7951283Z             x0 = x0.contiguous()
2025-05-07T20:32:33.7951644Z             x1 = x1.contiguous()
2025-05-07T20:32:33.7952008Z     
2025-05-07T20:32:33.7952282Z         if scale_ub is not None:
2025-05-07T20:32:33.7952572Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.7952910Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.7953227Z             )
2025-05-07T20:32:33.7953435Z         else:
2025-05-07T20:32:33.7953645Z             scale_ub_tensor = None
2025-05-07T20:32:33.7953914Z     
2025-05-07T20:32:33.7954153Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.7954477Z             op = silu_mul_quant
2025-05-07T20:32:33.7954750Z             if compiled:
2025-05-07T20:32:33.7955035Z                 op = torch.compile(op)
2025-05-07T20:32:33.7955358Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.7955642Z     
2025-05-07T20:32:33.7955852Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.7956020Z 
2025-05-07T20:32:33.7956120Z moe/activation_test.py:117: 
2025-05-07T20:32:33.7956430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.7956787Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.7957075Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.7957771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.7958465Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.7959003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.7959922Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.7960591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.7961123Z     kernel = self.compile(
2025-05-07T20:32:33.7961657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.7962314Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.7962722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.7962947Z 
2025-05-07T20:32:33.7963160Z self = <triton.compiler.compiler.ASTSource object at 0x7fb102b4bbf0>
2025-05-07T20:32:33.7964225Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.7965583Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb016dbc680>}
2025-05-07T20:32:33.7966911Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.7967924Z context = <triton._C.libtriton.ir.context object at 0x7fb016e34270>
2025-05-07T20:32:33.7968207Z 
2025-05-07T20:32:33.7968380Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.7968896Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.7969470Z                            module_map=module_map)
2025-05-07T20:32:33.7969843Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.7970303Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.7970576Z E       ^
2025-05-07T20:32:33.7971039Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.7971485Z 
2025-05-07T20:32:33.7971906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.7972469Z 
2025-05-07T20:32:33.9719831Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.9720308Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.9720845Z     T=4096,
2025-05-07T20:32:33.9721046Z     D=7168,
2025-05-07T20:32:33.9721238Z     scale_ub=1200.0,
2025-05-07T20:32:33.9721471Z     contiguous=False,
2025-05-07T20:32:33.9721703Z     compiled=True,
2025-05-07T20:32:33.9721918Z )
2025-05-07T20:32:33.9722241Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.9722742Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:33.9723015Z 
2025-05-07T20:32:33.9723099Z     @given(
2025-05-07T20:32:33.9723330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.9723643Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.9723953Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.9724286Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.9724628Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.9724923Z     )
2025-05-07T20:32:33.9725319Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.9725766Z     def test_silu_mul_quant(
2025-05-07T20:32:33.9726014Z         self,
2025-05-07T20:32:33.9726212Z         T: int,
2025-05-07T20:32:33.9726422Z         D: int,
2025-05-07T20:32:33.9726652Z         scale_ub: Optional[float],
2025-05-07T20:32:33.9726929Z         contiguous: bool,
2025-05-07T20:32:33.9727190Z         compiled: bool,
2025-05-07T20:32:33.9727416Z     ) -> None:
2025-05-07T20:32:33.9727628Z         torch.manual_seed(2025)
2025-05-07T20:32:33.9727876Z     
2025-05-07T20:32:33.9728158Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.9728504Z     
2025-05-07T20:32:33.9728693Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.9729010Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.9729329Z         x = x_sign * x_clamp
2025-05-07T20:32:33.9729571Z         x0 = x[:, :D]
2025-05-07T20:32:33.9729795Z         x1 = x[:, D:]
2025-05-07T20:32:33.9730009Z     
2025-05-07T20:32:33.9730193Z         if contiguous:
2025-05-07T20:32:33.9730430Z             x0 = x0.contiguous()
2025-05-07T20:32:33.9730691Z             x1 = x1.contiguous()
2025-05-07T20:32:33.9730941Z     
2025-05-07T20:32:33.9731141Z         if scale_ub is not None:
2025-05-07T20:32:33.9731414Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.9731751Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.9732050Z             )
2025-05-07T20:32:33.9732242Z         else:
2025-05-07T20:32:33.9732453Z             scale_ub_tensor = None
2025-05-07T20:32:33.9732697Z     
2025-05-07T20:32:33.9732937Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.9733338Z             op = silu_mul_quant
2025-05-07T20:32:33.9733580Z             if compiled:
2025-05-07T20:32:33.9733827Z                 op = torch.compile(op)
2025-05-07T20:32:33.9734126Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.9734391Z     
2025-05-07T20:32:33.9734589Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.9734750Z 
2025-05-07T20:32:33.9734856Z moe/activation_test.py:117: 
2025-05-07T20:32:33.9735149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.9735588Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.9735868Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.9736539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.9737091Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.9737745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.9738417Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.9739009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.9739782Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.9740560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.9741187Z     kernel = self.compile(
2025-05-07T20:32:33.9741811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.9742588Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.9743050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.9743311Z 
2025-05-07T20:32:33.9743548Z self = <triton.compiler.compiler.ASTSource object at 0x7fb103338d70>
2025-05-07T20:32:33.9744847Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.9746534Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb016dbd940>}
2025-05-07T20:32:33.9748184Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.9749407Z context = <triton._C.libtriton.ir.context object at 0x7fb016edd1f0>
2025-05-07T20:32:33.9749691Z 
2025-05-07T20:32:33.9749861Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.9750367Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.9750828Z                            module_map=module_map)
2025-05-07T20:32:33.9751196Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.9751539Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.9751795Z E       ^
2025-05-07T20:32:33.9752259Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.9752700Z 
2025-05-07T20:32:33.9753116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.9753619Z 
2025-05-07T20:32:33.9753727Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.9754134Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.9754536Z     T=128,
2025-05-07T20:32:33.9754727Z     D=7168,
2025-05-07T20:32:33.9754928Z     scale_ub=1200.0,
2025-05-07T20:32:33.9755170Z     contiguous=False,
2025-05-07T20:32:33.9755422Z     compiled=True,
2025-05-07T20:32:33.9755629Z )
2025-05-07T20:32:34.0669053Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.0669600Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:34.0669965Z 
2025-05-07T20:32:34.0670082Z     @given(
2025-05-07T20:32:34.0670394Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.0670710Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.0671128Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.0671458Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.0671894Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.0672188Z     )
2025-05-07T20:32:34.0672541Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.0672985Z     def test_silu_mul_quant(
2025-05-07T20:32:34.0673233Z         self,
2025-05-07T20:32:34.0673434Z         T: int,
2025-05-07T20:32:34.0673641Z         D: int,
2025-05-07T20:32:34.0673863Z         scale_ub: Optional[float],
2025-05-07T20:32:34.0674193Z         contiguous: bool,
2025-05-07T20:32:34.0674435Z         compiled: bool,
2025-05-07T20:32:34.0674661Z     ) -> None:
2025-05-07T20:32:34.0674882Z         torch.manual_seed(2025)
2025-05-07T20:32:34.0675125Z     
2025-05-07T20:32:34.0675403Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.0675739Z     
2025-05-07T20:32:34.0675945Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.0676242Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.0676554Z         x = x_sign * x_clamp
2025-05-07T20:32:34.0676796Z         x0 = x[:, :D]
2025-05-07T20:32:34.0677014Z         x1 = x[:, D:]
2025-05-07T20:32:34.0677214Z     
2025-05-07T20:32:34.0677404Z         if contiguous:
2025-05-07T20:32:34.0677635Z             x0 = x0.contiguous()
2025-05-07T20:32:34.0677892Z             x1 = x1.contiguous()
2025-05-07T20:32:34.0678132Z     
2025-05-07T20:32:34.0678330Z         if scale_ub is not None:
2025-05-07T20:32:34.0678610Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.0678937Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.0679245Z             )
2025-05-07T20:32:34.0679446Z         else:
2025-05-07T20:32:34.0679654Z             scale_ub_tensor = None
2025-05-07T20:32:34.0679907Z     
2025-05-07T20:32:34.0680139Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.0680451Z             op = silu_mul_quant
2025-05-07T20:32:34.0680702Z             if compiled:
2025-05-07T20:32:34.0680957Z                 op = torch.compile(op)
2025-05-07T20:32:34.0681246Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.0681525Z     
2025-05-07T20:32:34.0681717Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.0681882Z 
2025-05-07T20:32:34.0681979Z moe/activation_test.py:117: 
2025-05-07T20:32:34.0682280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.0682608Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.0682894Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.0683440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.0683994Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.0684652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.0685328Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.0685860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.0686541Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.0687203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.0687726Z     kernel = self.compile(
2025-05-07T20:32:34.0688262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.0688916Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.0689309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.0689535Z 
2025-05-07T20:32:34.0689746Z self = <triton.compiler.compiler.ASTSource object at 0x7fb10319e0c0>
2025-05-07T20:32:34.0690957Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.0692306Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb016dbe700>}
2025-05-07T20:32:34.0693711Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.0694776Z context = <triton._C.libtriton.ir.context object at 0x7fb016ecbeb0>
2025-05-07T20:32:34.0695066Z 
2025-05-07T20:32:34.0695230Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.0695741Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.0696207Z                            module_map=module_map)
2025-05-07T20:32:34.0696571Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.0696925Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.0697186Z E       ^
2025-05-07T20:32:34.0697637Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.0698092Z 
2025-05-07T20:32:34.0698502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.0706303Z 
2025-05-07T20:32:34.0706415Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.0706834Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.0707239Z     T=2048,
2025-05-07T20:32:34.0707423Z     D=7168,
2025-05-07T20:32:34.0707622Z     scale_ub=None,
2025-05-07T20:32:34.0707836Z     contiguous=True,
2025-05-07T20:32:34.0708061Z     compiled=True,
2025-05-07T20:32:34.0708270Z )
2025-05-07T20:32:34.0708595Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.0709077Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:34.0709345Z 
2025-05-07T20:32:34.0709424Z     @given(
2025-05-07T20:32:34.0709653Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.0709962Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.0710263Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.0710597Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.0710921Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.0711201Z     )
2025-05-07T20:32:34.0711548Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.0711993Z     def test_silu_mul_quant(
2025-05-07T20:32:34.0712228Z         self,
2025-05-07T20:32:34.0712425Z         T: int,
2025-05-07T20:32:34.0712628Z         D: int,
2025-05-07T20:32:34.0712845Z         scale_ub: Optional[float],
2025-05-07T20:32:34.0713126Z         contiguous: bool,
2025-05-07T20:32:34.0713372Z         compiled: bool,
2025-05-07T20:32:34.0713592Z     ) -> None:
2025-05-07T20:32:34.0713808Z         torch.manual_seed(2025)
2025-05-07T20:32:34.0714048Z     
2025-05-07T20:32:34.0714322Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.0714664Z     
2025-05-07T20:32:34.0714865Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.0715178Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.0715480Z         x = x_sign * x_clamp
2025-05-07T20:32:34.0715724Z         x0 = x[:, :D]
2025-05-07T20:32:34.0715948Z         x1 = x[:, D:]
2025-05-07T20:32:34.0716152Z     
2025-05-07T20:32:34.0716348Z         if contiguous:
2025-05-07T20:32:34.0716587Z             x0 = x0.contiguous()
2025-05-07T20:32:34.0716847Z             x1 = x1.contiguous()
2025-05-07T20:32:34.0717157Z     
2025-05-07T20:32:34.0717350Z         if scale_ub is not None:
2025-05-07T20:32:34.0717702Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.0718034Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.0718342Z             )
2025-05-07T20:32:34.0718536Z         else:
2025-05-07T20:32:34.0718739Z             scale_ub_tensor = None
2025-05-07T20:32:34.0718988Z     
2025-05-07T20:32:34.0719223Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.0719529Z             op = silu_mul_quant
2025-05-07T20:32:34.0719820Z             if compiled:
2025-05-07T20:32:34.0720075Z                 op = torch.compile(op)
2025-05-07T20:32:34.0720366Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.0720635Z     
2025-05-07T20:32:34.0720825Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.0720989Z 
2025-05-07T20:32:34.0721096Z moe/activation_test.py:117: 
2025-05-07T20:32:34.0721382Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.0721706Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.0721987Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.0722538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.0723091Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.0723743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.0724418Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.0724941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.0725610Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.0726255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.0726772Z     kernel = self.compile(
2025-05-07T20:32:34.0727315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.0727960Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.0728356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.0728577Z 
2025-05-07T20:32:34.0728780Z self = <triton.compiler.compiler.ASTSource object at 0x7fb10319e780>
2025-05-07T20:32:34.0729847Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.0731205Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb016dbf7e0>}
2025-05-07T20:32:34.0732528Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.0733594Z context = <triton._C.libtriton.ir.context object at 0x7fb016c58cb0>
2025-05-07T20:32:34.0733882Z 
2025-05-07T20:32:34.0734050Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.0734561Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.0735030Z                            module_map=module_map)
2025-05-07T20:32:34.0735381Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.0735733Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.0735994Z E       ^
2025-05-07T20:32:34.0736447Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.0736942Z 
2025-05-07T20:32:34.0737423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.0737925Z 
2025-05-07T20:32:34.1333681Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.1334131Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.1334566Z     T=16384,
2025-05-07T20:32:34.1334824Z     D=5120,
2025-05-07T20:32:34.1335093Z     scale_ub=None,
2025-05-07T20:32:34.1335339Z     contiguous=False,
2025-05-07T20:32:34.1335558Z     compiled=False,
2025-05-07T20:32:34.1335891Z )
2025-05-07T20:32:34.1336209Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.1336693Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:34.1336975Z 
2025-05-07T20:32:34.1337056Z     @given(
2025-05-07T20:32:34.1337287Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.1337603Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.1337902Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.1338246Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.1338571Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.1338850Z     )
2025-05-07T20:32:34.1339205Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.1339647Z     def test_silu_mul_quant(
2025-05-07T20:32:34.1339881Z         self,
2025-05-07T20:32:34.1340073Z         T: int,
2025-05-07T20:32:34.1340275Z         D: int,
2025-05-07T20:32:34.1340488Z         scale_ub: Optional[float],
2025-05-07T20:32:34.1340753Z         contiguous: bool,
2025-05-07T20:32:34.1340993Z         compiled: bool,
2025-05-07T20:32:34.1341215Z     ) -> None:
2025-05-07T20:32:34.1341430Z         torch.manual_seed(2025)
2025-05-07T20:32:34.1341672Z     
2025-05-07T20:32:34.1341945Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.1342285Z     
2025-05-07T20:32:34.1342483Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.1342774Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.1344760Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.1346605Z 
2025-05-07T20:32:34.1346728Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:34.1346935Z 
2025-05-07T20:32:34.1347036Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.1347447Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.1347839Z     T=4096,
2025-05-07T20:32:34.1348021Z     D=7168,
2025-05-07T20:32:34.1348210Z     scale_ub=1200.0,
2025-05-07T20:32:34.1348427Z     contiguous=True,
2025-05-07T20:32:34.1348668Z     compiled=True,
2025-05-07T20:32:34.1348862Z )
2025-05-07T20:32:34.1349177Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.1349661Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:34.1349928Z 
2025-05-07T20:32:34.1350014Z     @given(
2025-05-07T20:32:34.1350237Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.1350547Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.1350857Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.1351180Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.1351502Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.1351850Z     )
2025-05-07T20:32:34.1352185Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.1352727Z     def test_silu_mul_quant(
2025-05-07T20:32:34.1352972Z         self,
2025-05-07T20:32:34.1353167Z         T: int,
2025-05-07T20:32:34.1353362Z         D: int,
2025-05-07T20:32:34.1353587Z         scale_ub: Optional[float],
2025-05-07T20:32:34.1353850Z         contiguous: bool,
2025-05-07T20:32:34.1354086Z         compiled: bool,
2025-05-07T20:32:34.1354312Z     ) -> None:
2025-05-07T20:32:34.1354527Z         torch.manual_seed(2025)
2025-05-07T20:32:34.1354809Z     
2025-05-07T20:32:34.1355075Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.1355457Z     
2025-05-07T20:32:34.1355644Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.1355930Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.1357907Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.1359992Z 
2025-05-07T20:32:34.1360119Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:34.1360330Z 
2025-05-07T20:32:34.1360438Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.1360836Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.1361229Z     T=16384,
2025-05-07T20:32:34.1361423Z     D=7168,
2025-05-07T20:32:34.1361607Z     scale_ub=None,
2025-05-07T20:32:34.1361817Z     contiguous=False,
2025-05-07T20:32:34.1362040Z     compiled=False,
2025-05-07T20:32:34.1362240Z )
2025-05-07T20:32:34.1362549Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.1363044Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:34.1363316Z 
2025-05-07T20:32:34.1363394Z     @given(
2025-05-07T20:32:34.1363629Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.1363934Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.1364235Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.1364555Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.1364878Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.1365156Z     )
2025-05-07T20:32:34.1365546Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.1365988Z     def test_silu_mul_quant(
2025-05-07T20:32:34.1366224Z         self,
2025-05-07T20:32:34.1366426Z         T: int,
2025-05-07T20:32:34.1366619Z         D: int,
2025-05-07T20:32:34.1366832Z         scale_ub: Optional[float],
2025-05-07T20:32:34.1367106Z         contiguous: bool,
2025-05-07T20:32:34.1367354Z         compiled: bool,
2025-05-07T20:32:34.1367569Z     ) -> None:
2025-05-07T20:32:34.1367779Z         torch.manual_seed(2025)
2025-05-07T20:32:34.1368014Z     
2025-05-07T20:32:34.1368279Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.1370297Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.1372219Z 
2025-05-07T20:32:34.1372335Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.1372679Z 
2025-05-07T20:32:34.1372786Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.1373276Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.1373671Z     T=2048,
2025-05-07T20:32:34.1373857Z     D=7168,
2025-05-07T20:32:34.1374045Z     scale_ub=1200.0,
2025-05-07T20:32:34.1374259Z     contiguous=True,
2025-05-07T20:32:34.1374481Z     compiled=True,
2025-05-07T20:32:34.1374750Z )
2025-05-07T20:32:34.1375089Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.1375596Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:34.1375867Z 
2025-05-07T20:32:34.1375941Z     @given(
2025-05-07T20:32:34.1376167Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.1376467Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.1376775Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.1377100Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.1377417Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.1377697Z     )
2025-05-07T20:32:34.1378043Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.1378470Z     def test_silu_mul_quant(
2025-05-07T20:32:34.1378707Z         self,
2025-05-07T20:32:34.1378895Z         T: int,
2025-05-07T20:32:34.1379096Z         D: int,
2025-05-07T20:32:34.1379309Z         scale_ub: Optional[float],
2025-05-07T20:32:34.1379575Z         contiguous: bool,
2025-05-07T20:32:34.1379810Z         compiled: bool,
2025-05-07T20:32:34.1380024Z     ) -> None:
2025-05-07T20:32:34.1380239Z         torch.manual_seed(2025)
2025-05-07T20:32:34.1380477Z     
2025-05-07T20:32:34.1380736Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.1381070Z     
2025-05-07T20:32:34.1381263Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.1381546Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.1383512Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.1385331Z 
2025-05-07T20:32:34.1385446Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:34.1385656Z 
2025-05-07T20:32:34.1385758Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.1386160Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.1386549Z     T=2048,
2025-05-07T20:32:34.1386740Z     D=7168,
2025-05-07T20:32:34.1386929Z     scale_ub=None,
2025-05-07T20:32:34.1387134Z     contiguous=True,
2025-05-07T20:32:34.1387349Z     compiled=False,
2025-05-07T20:32:34.1387551Z )
2025-05-07T20:32:34.2520637Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.2522016Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:34.2522585Z 
2025-05-07T20:32:34.2522737Z     @given(
2025-05-07T20:32:34.2523198Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.2523808Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.2524394Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.2525032Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.2525413Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.2525826Z     )
2025-05-07T20:32:34.2526167Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.2526712Z     def test_silu_mul_quant(
2025-05-07T20:32:34.2526945Z         self,
2025-05-07T20:32:34.2527141Z         T: int,
2025-05-07T20:32:34.2527350Z         D: int,
2025-05-07T20:32:34.2527558Z         scale_ub: Optional[float],
2025-05-07T20:32:34.2527838Z         contiguous: bool,
2025-05-07T20:32:34.2528083Z         compiled: bool,
2025-05-07T20:32:34.2528300Z     ) -> None:
2025-05-07T20:32:34.2528515Z         torch.manual_seed(2025)
2025-05-07T20:32:34.2528826Z     
2025-05-07T20:32:34.2529094Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.2529422Z     
2025-05-07T20:32:34.2529616Z >       x_sign = torch.sign(x)
2025-05-07T20:32:34.2531563Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.2533460Z 
2025-05-07T20:32:34.2533578Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:34.2533786Z 
2025-05-07T20:32:34.2533885Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.2534293Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.2534682Z     T=1,
2025-05-07T20:32:34.2534862Z     D=7168,
2025-05-07T20:32:34.2535050Z     scale_ub=1200.0,
2025-05-07T20:32:34.2535266Z     contiguous=True,
2025-05-07T20:32:34.2535490Z     compiled=False,
2025-05-07T20:32:34.2535686Z )
2025-05-07T20:32:34.2535997Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.2536473Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:34.2536735Z 
2025-05-07T20:32:34.2536814Z     @given(
2025-05-07T20:32:34.2537037Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.2537340Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.2537638Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.2537965Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.2538288Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.2538566Z     )
2025-05-07T20:32:34.2538910Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.2539344Z     def test_silu_mul_quant(
2025-05-07T20:32:34.2539589Z         self,
2025-05-07T20:32:34.2539776Z         T: int,
2025-05-07T20:32:34.2539970Z         D: int,
2025-05-07T20:32:34.2540184Z         scale_ub: Optional[float],
2025-05-07T20:32:34.2540447Z         contiguous: bool,
2025-05-07T20:32:34.2540691Z         compiled: bool,
2025-05-07T20:32:34.2540916Z     ) -> None:
2025-05-07T20:32:34.2541127Z         torch.manual_seed(2025)
2025-05-07T20:32:34.2541369Z     
2025-05-07T20:32:34.2541635Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.2541963Z     
2025-05-07T20:32:34.2542157Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.2542446Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.2542751Z         x = x_sign * x_clamp
2025-05-07T20:32:34.2542988Z         x0 = x[:, :D]
2025-05-07T20:32:34.2543206Z         x1 = x[:, D:]
2025-05-07T20:32:34.2543410Z     
2025-05-07T20:32:34.2543590Z         if contiguous:
2025-05-07T20:32:34.2543818Z             x0 = x0.contiguous()
2025-05-07T20:32:34.2544072Z             x1 = x1.contiguous()
2025-05-07T20:32:34.2544303Z     
2025-05-07T20:32:34.2544495Z         if scale_ub is not None:
2025-05-07T20:32:34.2544770Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.2545180Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.2545495Z             )
2025-05-07T20:32:34.2545759Z         else:
2025-05-07T20:32:34.2545967Z             scale_ub_tensor = None
2025-05-07T20:32:34.2546224Z     
2025-05-07T20:32:34.2546459Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.2546767Z             op = silu_mul_quant
2025-05-07T20:32:34.2547011Z             if compiled:
2025-05-07T20:32:34.2547258Z                 op = torch.compile(op)
2025-05-07T20:32:34.2547544Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.2547865Z     
2025-05-07T20:32:34.2548069Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.2548235Z 
2025-05-07T20:32:34.2548336Z moe/activation_test.py:117: 
2025-05-07T20:32:34.2548624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.2548945Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.2549225Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.2549906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.2550583Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.2551108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.2551772Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.2552425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.2552952Z     kernel = self.compile(
2025-05-07T20:32:34.2553485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.2554122Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.2554517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.2554754Z 
2025-05-07T20:32:34.2554965Z self = <triton.compiler.compiler.ASTSource object at 0x7fb01750b740>
2025-05-07T20:32:34.2556077Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.2557421Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb016cd2b60>}
2025-05-07T20:32:34.2558743Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.2559936Z context = <triton._C.libtriton.ir.context object at 0x7fb016cec970>
2025-05-07T20:32:34.2560222Z 
2025-05-07T20:32:34.2560404Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.2560915Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.2561376Z                            module_map=module_map)
2025-05-07T20:32:34.2561744Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.2562100Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.2562349Z E       ^
2025-05-07T20:32:34.2562807Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.2563258Z 
2025-05-07T20:32:34.2563666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.2564166Z 
2025-05-07T20:32:34.2564271Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.2564670Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.2565184Z     T=128,
2025-05-07T20:32:34.2565372Z     D=5120,
2025-05-07T20:32:34.2565580Z     scale_ub=None,
2025-05-07T20:32:34.2565930Z     contiguous=True,
2025-05-07T20:32:34.2566151Z     compiled=False,
2025-05-07T20:32:34.2566349Z )
2025-05-07T20:32:34.3243524Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.3244065Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:34.3244336Z 
2025-05-07T20:32:34.3244416Z     @given(
2025-05-07T20:32:34.3244644Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.3245068Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.3245371Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.3245699Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.3246020Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.3246298Z     )
2025-05-07T20:32:34.3246642Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.3247082Z     def test_silu_mul_quant(
2025-05-07T20:32:34.3247321Z         self,
2025-05-07T20:32:34.3247513Z         T: int,
2025-05-07T20:32:34.3247703Z         D: int,
2025-05-07T20:32:34.3247914Z         scale_ub: Optional[float],
2025-05-07T20:32:34.3248188Z         contiguous: bool,
2025-05-07T20:32:34.3248424Z         compiled: bool,
2025-05-07T20:32:34.3248646Z     ) -> None:
2025-05-07T20:32:34.3248859Z         torch.manual_seed(2025)
2025-05-07T20:32:34.3249100Z     
2025-05-07T20:32:34.3249370Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.3249702Z     
2025-05-07T20:32:34.3249892Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.3250174Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.3250470Z         x = x_sign * x_clamp
2025-05-07T20:32:34.3250707Z         x0 = x[:, :D]
2025-05-07T20:32:34.3250923Z         x1 = x[:, D:]
2025-05-07T20:32:34.3251123Z     
2025-05-07T20:32:34.3251314Z         if contiguous:
2025-05-07T20:32:34.3251546Z             x0 = x0.contiguous()
2025-05-07T20:32:34.3251801Z             x1 = x1.contiguous()
2025-05-07T20:32:34.3252042Z     
2025-05-07T20:32:34.3252227Z         if scale_ub is not None:
2025-05-07T20:32:34.3252490Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.3252817Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.3253195Z             )
2025-05-07T20:32:34.3261533Z         else:
2025-05-07T20:32:34.3261784Z             scale_ub_tensor = None
2025-05-07T20:32:34.3262047Z     
2025-05-07T20:32:34.3262285Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.3262600Z             op = silu_mul_quant
2025-05-07T20:32:34.3262843Z             if compiled:
2025-05-07T20:32:34.3263091Z                 op = torch.compile(op)
2025-05-07T20:32:34.3263385Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.3263645Z     
2025-05-07T20:32:34.3263832Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.3263992Z 
2025-05-07T20:32:34.3264096Z moe/activation_test.py:117: 
2025-05-07T20:32:34.3264384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.3264708Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.3264985Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.3265666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.3266335Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.3266863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.3267526Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.3268178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.3268810Z     kernel = self.compile(
2025-05-07T20:32:34.3269461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.3270110Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.3270499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.3270727Z 
2025-05-07T20:32:34.3270935Z self = <triton.compiler.compiler.ASTSource object at 0x7fb0175099a0>
2025-05-07T20:32:34.3271995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.3273399Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb016cd3c40>}
2025-05-07T20:32:34.3274724Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.3275767Z context = <triton._C.libtriton.ir.context object at 0x7fb016b86130>
2025-05-07T20:32:34.3276049Z 
2025-05-07T20:32:34.3276209Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.3276712Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.3277173Z                            module_map=module_map)
2025-05-07T20:32:34.3277526Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.3277869Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.3278123Z E       ^
2025-05-07T20:32:34.3278572Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.3279019Z 
2025-05-07T20:32:34.3279428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.3279942Z 
2025-05-07T20:32:34.3280043Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.3280449Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.3280840Z     T=128,
2025-05-07T20:32:34.3281025Z     D=7168,
2025-05-07T20:32:34.3281213Z     scale_ub=None,
2025-05-07T20:32:34.3281413Z     contiguous=True,
2025-05-07T20:32:34.3281630Z     compiled=False,
2025-05-07T20:32:34.3281835Z )
2025-05-07T20:32:34.3282146Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.3282623Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:34.3282878Z 
2025-05-07T20:32:34.3282959Z     @given(
2025-05-07T20:32:34.3283178Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.3283482Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.3283779Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.3284103Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.3284422Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.3284700Z     )
2025-05-07T20:32:34.3285041Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.3285471Z     def test_silu_mul_quant(
2025-05-07T20:32:34.3285712Z         self,
2025-05-07T20:32:34.3285902Z         T: int,
2025-05-07T20:32:34.3286088Z         D: int,
2025-05-07T20:32:34.3286303Z         scale_ub: Optional[float],
2025-05-07T20:32:34.3286572Z         contiguous: bool,
2025-05-07T20:32:34.3286804Z         compiled: bool,
2025-05-07T20:32:34.3287019Z     ) -> None:
2025-05-07T20:32:34.3287226Z         torch.manual_seed(2025)
2025-05-07T20:32:34.3287460Z     
2025-05-07T20:32:34.3287723Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.3288107Z     
2025-05-07T20:32:34.3288298Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.3288652Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.3288960Z         x = x_sign * x_clamp
2025-05-07T20:32:34.3289194Z         x0 = x[:, :D]
2025-05-07T20:32:34.3289402Z         x1 = x[:, D:]
2025-05-07T20:32:34.3289609Z     
2025-05-07T20:32:34.3289798Z         if contiguous:
2025-05-07T20:32:34.3290024Z             x0 = x0.contiguous()
2025-05-07T20:32:34.3290283Z             x1 = x1.contiguous()
2025-05-07T20:32:34.3290519Z     
2025-05-07T20:32:34.3290803Z         if scale_ub is not None:
2025-05-07T20:32:34.3291072Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.3291401Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.3291701Z             )
2025-05-07T20:32:34.3291898Z         else:
2025-05-07T20:32:34.3292114Z             scale_ub_tensor = None
2025-05-07T20:32:34.3292365Z     
2025-05-07T20:32:34.3292602Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.3292909Z             op = silu_mul_quant
2025-05-07T20:32:34.3293210Z             if compiled:
2025-05-07T20:32:34.3293457Z                 op = torch.compile(op)
2025-05-07T20:32:34.3293748Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.3294016Z     
2025-05-07T20:32:34.3294202Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.3294366Z 
2025-05-07T20:32:34.3294464Z moe/activation_test.py:117: 
2025-05-07T20:32:34.3294757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.3295084Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.3295360Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.3296036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.3296719Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.3297247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.3297922Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.3298577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.3299092Z     kernel = self.compile(
2025-05-07T20:32:34.3299622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.3300261Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.3300656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.3300878Z 
2025-05-07T20:32:34.3301083Z self = <triton.compiler.compiler.ASTSource object at 0x7fb016ba1850>
2025-05-07T20:32:34.3302140Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.3303487Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb016b70ae0>}
2025-05-07T20:32:34.3304798Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.3305810Z context = <triton._C.libtriton.ir.context object at 0x7fb016a65db0>
2025-05-07T20:32:34.3306095Z 
2025-05-07T20:32:34.3306262Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.3306775Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.3307238Z                            module_map=module_map)
2025-05-07T20:32:34.3307647Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.3307997Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.3308255Z E       ^
2025-05-07T20:32:34.3308786Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.3309226Z 
2025-05-07T20:32:34.3309637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.3310139Z 
2025-05-07T20:32:34.3310242Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.3310691Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.3311089Z     T=2048,
2025-05-07T20:32:34.3311271Z     D=7168,
2025-05-07T20:32:34.3311463Z     scale_ub=1200.0,
2025-05-07T20:32:34.3311691Z     contiguous=True,
2025-05-07T20:32:34.3311908Z     compiled=False,
2025-05-07T20:32:34.3312116Z )
2025-05-07T20:32:34.4119917Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.4120981Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:34.4121525Z 
2025-05-07T20:32:34.4121682Z     @given(
2025-05-07T20:32:34.4122127Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.4122746Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.4123341Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.4123995Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.4124633Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.4125167Z     )
2025-05-07T20:32:34.4125507Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.4125940Z     def test_silu_mul_quant(
2025-05-07T20:32:34.4126181Z         self,
2025-05-07T20:32:34.4126376Z         T: int,
2025-05-07T20:32:34.4126566Z         D: int,
2025-05-07T20:32:34.4126782Z         scale_ub: Optional[float],
2025-05-07T20:32:34.4127051Z         contiguous: bool,
2025-05-07T20:32:34.4127284Z         compiled: bool,
2025-05-07T20:32:34.4127507Z     ) -> None:
2025-05-07T20:32:34.4127724Z         torch.manual_seed(2025)
2025-05-07T20:32:34.4127956Z     
2025-05-07T20:32:34.4128227Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.4130254Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.4132073Z 
2025-05-07T20:32:34.4132194Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.4132402Z 
2025-05-07T20:32:34.4132510Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.4132917Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.4133365Z     T=1,
2025-05-07T20:32:34.4133544Z     D=5120,
2025-05-07T20:32:34.4133724Z     scale_ub=1200.0,
2025-05-07T20:32:34.4133939Z     contiguous=True,
2025-05-07T20:32:34.4134167Z     compiled=False,
2025-05-07T20:32:34.4134364Z )
2025-05-07T20:32:34.4134675Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.4135162Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:34.4135418Z 
2025-05-07T20:32:34.4135494Z     @given(
2025-05-07T20:32:34.4135721Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.4136024Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.4136332Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.4136763Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.4137085Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.4137498Z     )
2025-05-07T20:32:34.4137839Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.4138277Z     def test_silu_mul_quant(
2025-05-07T20:32:34.4138514Z         self,
2025-05-07T20:32:34.4138709Z         T: int,
2025-05-07T20:32:34.4138906Z         D: int,
2025-05-07T20:32:34.4139116Z         scale_ub: Optional[float],
2025-05-07T20:32:34.4139379Z         contiguous: bool,
2025-05-07T20:32:34.4139687Z         compiled: bool,
2025-05-07T20:32:34.4139906Z     ) -> None:
2025-05-07T20:32:34.4140112Z         torch.manual_seed(2025)
2025-05-07T20:32:34.4140347Z     
2025-05-07T20:32:34.4140614Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.4140942Z     
2025-05-07T20:32:34.4141135Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.4141424Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.4141729Z         x = x_sign * x_clamp
2025-05-07T20:32:34.4141969Z         x0 = x[:, :D]
2025-05-07T20:32:34.4142191Z         x1 = x[:, D:]
2025-05-07T20:32:34.4142397Z     
2025-05-07T20:32:34.4142589Z         if contiguous:
2025-05-07T20:32:34.4142822Z             x0 = x0.contiguous()
2025-05-07T20:32:34.4143076Z             x1 = x1.contiguous()
2025-05-07T20:32:34.4143310Z     
2025-05-07T20:32:34.4143504Z         if scale_ub is not None:
2025-05-07T20:32:34.4143777Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.4144108Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.4144418Z             )
2025-05-07T20:32:34.4144606Z         else:
2025-05-07T20:32:34.4144812Z             scale_ub_tensor = None
2025-05-07T20:32:34.4145070Z     
2025-05-07T20:32:34.4145297Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.4145606Z             op = silu_mul_quant
2025-05-07T20:32:34.4145855Z             if compiled:
2025-05-07T20:32:34.4146107Z                 op = torch.compile(op)
2025-05-07T20:32:34.4146396Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.4146662Z     
2025-05-07T20:32:34.4146854Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.4147013Z 
2025-05-07T20:32:34.4147110Z moe/activation_test.py:117: 
2025-05-07T20:32:34.4147401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.4147724Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.4148001Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.4148679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.4149356Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.4149880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.4150541Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.4151199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.4151721Z     kernel = self.compile(
2025-05-07T20:32:34.4152254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.4152887Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.4153275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.4153507Z 
2025-05-07T20:32:34.4153713Z self = <triton.compiler.compiler.ASTSource object at 0x7fb016ba1400>
2025-05-07T20:32:34.4154776Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.4156238Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb016b720c0>}
2025-05-07T20:32:34.4157560Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.4158563Z context = <triton._C.libtriton.ir.context object at 0x7fb016ac3ef0>
2025-05-07T20:32:34.4158844Z 
2025-05-07T20:32:34.4159011Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.4159722Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.4160181Z                            module_map=module_map)
2025-05-07T20:32:34.4160554Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.4160904Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.4161166Z E       ^
2025-05-07T20:32:34.4161633Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.4162077Z 
2025-05-07T20:32:34.4162490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.4162987Z 
2025-05-07T20:32:34.4163094Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.4163498Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.4163892Z     T=2048,
2025-05-07T20:32:34.4164083Z     D=5120,
2025-05-07T20:32:34.4164267Z     scale_ub=None,
2025-05-07T20:32:34.4164478Z     contiguous=True,
2025-05-07T20:32:34.4164697Z     compiled=False,
2025-05-07T20:32:34.4164892Z )
2025-05-07T20:32:34.4165210Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.4165691Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:34.4165959Z 
2025-05-07T20:32:34.4166037Z     @given(
2025-05-07T20:32:34.4166272Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.4166579Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.4166883Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.4167203Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.4167533Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.4167810Z     )
2025-05-07T20:32:34.4168150Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.4168581Z     def test_silu_mul_quant(
2025-05-07T20:32:34.4168816Z         self,
2025-05-07T20:32:34.4168999Z         T: int,
2025-05-07T20:32:34.4169193Z         D: int,
2025-05-07T20:32:34.4169403Z         scale_ub: Optional[float],
2025-05-07T20:32:34.4169663Z         contiguous: bool,
2025-05-07T20:32:34.4169899Z         compiled: bool,
2025-05-07T20:32:34.4170118Z     ) -> None:
2025-05-07T20:32:34.4170324Z         torch.manual_seed(2025)
2025-05-07T20:32:34.4170557Z     
2025-05-07T20:32:34.4170824Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.4171158Z     
2025-05-07T20:32:34.4171348Z >       x_sign = torch.sign(x)
2025-05-07T20:32:34.4173323Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.4175157Z 
2025-05-07T20:32:34.4175271Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:34.4175555Z 
2025-05-07T20:32:34.4175663Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.4176176Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.4176576Z     T=16384,
2025-05-07T20:32:34.4176774Z     D=5120,
2025-05-07T20:32:34.4176971Z     scale_ub=None,
2025-05-07T20:32:34.4177186Z     contiguous=True,
2025-05-07T20:32:34.4177413Z     compiled=False,
2025-05-07T20:32:34.4177622Z )
2025-05-07T20:32:34.4936770Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.4937330Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:34.4937735Z 
2025-05-07T20:32:34.4937812Z     @given(
2025-05-07T20:32:34.4938038Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.4938348Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.4938644Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.4938971Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.4939304Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.4939577Z     )
2025-05-07T20:32:34.4939929Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.4940361Z     def test_silu_mul_quant(
2025-05-07T20:32:34.4940602Z         self,
2025-05-07T20:32:34.4940789Z         T: int,
2025-05-07T20:32:34.4940985Z         D: int,
2025-05-07T20:32:34.4941199Z         scale_ub: Optional[float],
2025-05-07T20:32:34.4941465Z         contiguous: bool,
2025-05-07T20:32:34.4941704Z         compiled: bool,
2025-05-07T20:32:34.4941929Z     ) -> None:
2025-05-07T20:32:34.4942141Z         torch.manual_seed(2025)
2025-05-07T20:32:34.4942378Z     
2025-05-07T20:32:34.4942645Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.4944665Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.4946554Z 
2025-05-07T20:32:34.4946673Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.4946885Z 
2025-05-07T20:32:34.4946987Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.4947396Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.4947790Z     T=4096,
2025-05-07T20:32:34.4947969Z     D=5120,
2025-05-07T20:32:34.4948159Z     scale_ub=None,
2025-05-07T20:32:34.4948364Z     contiguous=True,
2025-05-07T20:32:34.4948578Z     compiled=False,
2025-05-07T20:32:34.4948785Z )
2025-05-07T20:32:34.4949100Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.4949583Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:34.4949850Z 
2025-05-07T20:32:34.4949926Z     @given(
2025-05-07T20:32:34.4950151Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.4950458Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.4950751Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.4951071Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.4951394Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.4951673Z     )
2025-05-07T20:32:34.4952010Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.4952444Z     def test_silu_mul_quant(
2025-05-07T20:32:34.4952685Z         self,
2025-05-07T20:32:34.4952875Z         T: int,
2025-05-07T20:32:34.4953068Z         D: int,
2025-05-07T20:32:34.4953347Z         scale_ub: Optional[float],
2025-05-07T20:32:34.4953616Z         contiguous: bool,
2025-05-07T20:32:34.4953853Z         compiled: bool,
2025-05-07T20:32:34.4954178Z     ) -> None:
2025-05-07T20:32:34.4954398Z         torch.manual_seed(2025)
2025-05-07T20:32:34.4954637Z     
2025-05-07T20:32:34.4954902Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.4956954Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.4958816Z 
2025-05-07T20:32:34.4958939Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.4959149Z 
2025-05-07T20:32:34.4959419Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.4959831Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.4960223Z     T=2048,
2025-05-07T20:32:34.4960407Z     D=5120,
2025-05-07T20:32:34.4960591Z     scale_ub=None,
2025-05-07T20:32:34.4960796Z     contiguous=False,
2025-05-07T20:32:34.4961020Z     compiled=False,
2025-05-07T20:32:34.4961220Z )
2025-05-07T20:32:34.4961532Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.4962019Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:34.4962286Z 
2025-05-07T20:32:34.4962363Z     @given(
2025-05-07T20:32:34.4962598Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.4962908Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.4963219Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.4963552Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.4963875Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.4964154Z     )
2025-05-07T20:32:34.4964496Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.4964924Z     def test_silu_mul_quant(
2025-05-07T20:32:34.4965159Z         self,
2025-05-07T20:32:34.4965357Z         T: int,
2025-05-07T20:32:34.4965583Z         D: int,
2025-05-07T20:32:34.4965820Z         scale_ub: Optional[float],
2025-05-07T20:32:34.4966092Z         contiguous: bool,
2025-05-07T20:32:34.4966325Z         compiled: bool,
2025-05-07T20:32:34.4966537Z     ) -> None:
2025-05-07T20:32:34.4966745Z         torch.manual_seed(2025)
2025-05-07T20:32:34.4966985Z     
2025-05-07T20:32:34.4967244Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.4969247Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.4971062Z 
2025-05-07T20:32:34.4971183Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.4971401Z 
2025-05-07T20:32:34.4971503Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.4971910Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.4972297Z     T=4096,
2025-05-07T20:32:34.4972486Z     D=7168,
2025-05-07T20:32:34.4972674Z     scale_ub=None,
2025-05-07T20:32:34.4972877Z     contiguous=True,
2025-05-07T20:32:34.4973246Z     compiled=True,
2025-05-07T20:32:34.4973449Z )
2025-05-07T20:32:34.4973755Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.4974370Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:34.4974636Z 
2025-05-07T20:32:34.4974719Z     @given(
2025-05-07T20:32:34.4974949Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.4975256Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.4975559Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.4975879Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.4976261Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.4976540Z     )
2025-05-07T20:32:34.4976883Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.4977318Z     def test_silu_mul_quant(
2025-05-07T20:32:34.4977567Z         self,
2025-05-07T20:32:34.4977765Z         T: int,
2025-05-07T20:32:34.4986532Z         D: int,
2025-05-07T20:32:34.4986792Z         scale_ub: Optional[float],
2025-05-07T20:32:34.4987070Z         contiguous: bool,
2025-05-07T20:32:34.4987313Z         compiled: bool,
2025-05-07T20:32:34.4987527Z     ) -> None:
2025-05-07T20:32:34.4987739Z         torch.manual_seed(2025)
2025-05-07T20:32:34.4987973Z     
2025-05-07T20:32:34.4988247Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.4990265Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.4992096Z 
2025-05-07T20:32:34.4992218Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.4992430Z 
2025-05-07T20:32:34.4992535Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.4992938Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.4993321Z     T=2048,
2025-05-07T20:32:34.4993503Z     D=5120,
2025-05-07T20:32:34.4993686Z     scale_ub=1200.0,
2025-05-07T20:32:34.4993907Z     contiguous=False,
2025-05-07T20:32:34.4994132Z     compiled=False,
2025-05-07T20:32:34.4994350Z )
2025-05-07T20:32:34.4994660Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.4995144Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:34.4995415Z 
2025-05-07T20:32:34.4995492Z     @given(
2025-05-07T20:32:34.4995718Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.4996015Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.4996322Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.4996645Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.4996959Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.4997235Z     )
2025-05-07T20:32:34.4997571Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.4998002Z     def test_silu_mul_quant(
2025-05-07T20:32:34.4998236Z         self,
2025-05-07T20:32:34.4998421Z         T: int,
2025-05-07T20:32:34.4998618Z         D: int,
2025-05-07T20:32:34.4998831Z         scale_ub: Optional[float],
2025-05-07T20:32:34.4999093Z         contiguous: bool,
2025-05-07T20:32:34.4999326Z         compiled: bool,
2025-05-07T20:32:34.4999537Z     ) -> None:
2025-05-07T20:32:34.4999745Z         torch.manual_seed(2025)
2025-05-07T20:32:34.4999978Z     
2025-05-07T20:32:34.5000234Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.5002393Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.5004249Z 
2025-05-07T20:32:34.5004363Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.5004572Z 
2025-05-07T20:32:34.5004672Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.5005070Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.5005459Z     T=4096,
2025-05-07T20:32:34.5005646Z     D=7168,
2025-05-07T20:32:34.5005830Z     scale_ub=1200.0,
2025-05-07T20:32:34.5006045Z     contiguous=True,
2025-05-07T20:32:34.5006255Z     compiled=False,
2025-05-07T20:32:34.5006458Z )
2025-05-07T20:32:34.6062273Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.6062813Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:34.6063085Z 
2025-05-07T20:32:34.6063164Z     @given(
2025-05-07T20:32:34.6063403Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.6063718Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.6064023Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.6064363Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.6064698Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.6064987Z     )
2025-05-07T20:32:34.6065362Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.6065829Z     def test_silu_mul_quant(
2025-05-07T20:32:34.6066077Z         self,
2025-05-07T20:32:34.6066268Z         T: int,
2025-05-07T20:32:34.6066466Z         D: int,
2025-05-07T20:32:34.6066694Z         scale_ub: Optional[float],
2025-05-07T20:32:34.6066957Z         contiguous: bool,
2025-05-07T20:32:34.6067198Z         compiled: bool,
2025-05-07T20:32:34.6067420Z     ) -> None:
2025-05-07T20:32:34.6067630Z         torch.manual_seed(2025)
2025-05-07T20:32:34.6067870Z     
2025-05-07T20:32:34.6068145Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.6070164Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.6071999Z 
2025-05-07T20:32:34.6072127Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.6072340Z 
2025-05-07T20:32:34.6072443Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.6072855Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.6073254Z     T=16384,
2025-05-07T20:32:34.6073444Z     D=7168,
2025-05-07T20:32:34.6073633Z     scale_ub=None,
2025-05-07T20:32:34.6073842Z     contiguous=False,
2025-05-07T20:32:34.6074071Z     compiled=True,
2025-05-07T20:32:34.6074269Z )
2025-05-07T20:32:34.6074581Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.6075065Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:34.6075350Z 
2025-05-07T20:32:34.6075433Z     @given(
2025-05-07T20:32:34.6075704Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.6076112Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.6076525Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.6076856Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.6077191Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.6077470Z     )
2025-05-07T20:32:34.6077822Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.6078262Z     def test_silu_mul_quant(
2025-05-07T20:32:34.6078504Z         self,
2025-05-07T20:32:34.6078771Z         T: int,
2025-05-07T20:32:34.6078970Z         D: int,
2025-05-07T20:32:34.6079188Z         scale_ub: Optional[float],
2025-05-07T20:32:34.6079453Z         contiguous: bool,
2025-05-07T20:32:34.6079690Z         compiled: bool,
2025-05-07T20:32:34.6079914Z     ) -> None:
2025-05-07T20:32:34.6080124Z         torch.manual_seed(2025)
2025-05-07T20:32:34.6080370Z     
2025-05-07T20:32:34.6080644Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.6082655Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.6084481Z 
2025-05-07T20:32:34.6084598Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.6084811Z 
2025-05-07T20:32:34.6084914Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.6085322Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.6085717Z     T=4096,
2025-05-07T20:32:34.6085904Z     D=7168,
2025-05-07T20:32:34.6086099Z     scale_ub=None,
2025-05-07T20:32:34.6086317Z     contiguous=True,
2025-05-07T20:32:34.6086541Z     compiled=False,
2025-05-07T20:32:34.6086748Z )
2025-05-07T20:32:34.6087066Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.6087546Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:34.6087816Z 
2025-05-07T20:32:34.6087892Z     @given(
2025-05-07T20:32:34.6088120Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.6088428Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.6088730Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.6089053Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.6089375Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.6089651Z     )
2025-05-07T20:32:34.6089993Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.6090430Z     def test_silu_mul_quant(
2025-05-07T20:32:34.6090673Z         self,
2025-05-07T20:32:34.6090866Z         T: int,
2025-05-07T20:32:34.6091065Z         D: int,
2025-05-07T20:32:34.6091279Z         scale_ub: Optional[float],
2025-05-07T20:32:34.6091547Z         contiguous: bool,
2025-05-07T20:32:34.6091788Z         compiled: bool,
2025-05-07T20:32:34.6092003Z     ) -> None:
2025-05-07T20:32:34.6092214Z         torch.manual_seed(2025)
2025-05-07T20:32:34.6092456Z     
2025-05-07T20:32:34.6092720Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.6094878Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.6096750Z 
2025-05-07T20:32:34.6096867Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.6097084Z 
2025-05-07T20:32:34.6097186Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.6097591Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.6097983Z     T=16384,
2025-05-07T20:32:34.6098175Z     D=7168,
2025-05-07T20:32:34.6098417Z     scale_ub=None,
2025-05-07T20:32:34.6098627Z     contiguous=True,
2025-05-07T20:32:34.6098850Z     compiled=False,
2025-05-07T20:32:34.6099056Z )
2025-05-07T20:32:34.6099375Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.6099871Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:34.6100148Z 
2025-05-07T20:32:34.6100227Z     @given(
2025-05-07T20:32:34.6100453Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.6100766Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.6101069Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.6101392Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.6101715Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.6101998Z     )
2025-05-07T20:32:34.6102341Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.6102774Z     def test_silu_mul_quant(
2025-05-07T20:32:34.6103024Z         self,
2025-05-07T20:32:34.6103219Z         T: int,
2025-05-07T20:32:34.6103416Z         D: int,
2025-05-07T20:32:34.6103630Z         scale_ub: Optional[float],
2025-05-07T20:32:34.6103903Z         contiguous: bool,
2025-05-07T20:32:34.6104142Z         compiled: bool,
2025-05-07T20:32:34.6104361Z     ) -> None:
2025-05-07T20:32:34.6104575Z         torch.manual_seed(2025)
2025-05-07T20:32:34.6104819Z     
2025-05-07T20:32:34.6105084Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.6107098Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.6108924Z 
2025-05-07T20:32:34.6109043Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.6109252Z 
2025-05-07T20:32:34.6109359Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.6109768Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.6110164Z     T=16384,
2025-05-07T20:32:34.6110362Z     D=7168,
2025-05-07T20:32:34.6110550Z     scale_ub=1200.0,
2025-05-07T20:32:34.6110772Z     contiguous=True,
2025-05-07T20:32:34.6110993Z     compiled=False,
2025-05-07T20:32:34.6111195Z )
2025-05-07T20:32:34.6111528Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.6112016Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:34.6112298Z 
2025-05-07T20:32:34.6112378Z     @given(
2025-05-07T20:32:34.6112621Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.6112928Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.6113229Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.6113554Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.6113878Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.6114165Z     )
2025-05-07T20:32:34.6114563Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.6114995Z     def test_silu_mul_quant(
2025-05-07T20:32:34.6115334Z         self,
2025-05-07T20:32:34.6115529Z         T: int,
2025-05-07T20:32:34.6115724Z         D: int,
2025-05-07T20:32:34.6115939Z         scale_ub: Optional[float],
2025-05-07T20:32:34.6116214Z         contiguous: bool,
2025-05-07T20:32:34.6116451Z         compiled: bool,
2025-05-07T20:32:34.6116671Z     ) -> None:
2025-05-07T20:32:34.6116883Z         torch.manual_seed(2025)
2025-05-07T20:32:34.6117120Z     
2025-05-07T20:32:34.6117427Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.6119432Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.6121254Z 
2025-05-07T20:32:34.6121371Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.6121579Z 
2025-05-07T20:32:34.6121684Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.6122088Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.6122490Z     T=128,
2025-05-07T20:32:34.6122685Z     D=5120,
2025-05-07T20:32:34.6122883Z     scale_ub=1200.0,
2025-05-07T20:32:34.6123109Z     contiguous=False,
2025-05-07T20:32:34.6123335Z     compiled=False,
2025-05-07T20:32:34.6123550Z )
2025-05-07T20:32:34.7408137Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.7409162Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:34.7409729Z 
2025-05-07T20:32:34.7409889Z     @given(
2025-05-07T20:32:34.7410353Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.7410957Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.7411557Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.7412205Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.7412839Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.7413521Z     )
2025-05-07T20:32:34.7414202Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.7415066Z     def test_silu_mul_quant(
2025-05-07T20:32:34.7415351Z         self,
2025-05-07T20:32:34.7415581Z         T: int,
2025-05-07T20:32:34.7415792Z         D: int,
2025-05-07T20:32:34.7416012Z         scale_ub: Optional[float],
2025-05-07T20:32:34.7416284Z         contiguous: bool,
2025-05-07T20:32:34.7416524Z         compiled: bool,
2025-05-07T20:32:34.7416745Z     ) -> None:
2025-05-07T20:32:34.7416960Z         torch.manual_seed(2025)
2025-05-07T20:32:34.7417201Z     
2025-05-07T20:32:34.7417472Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.7417816Z     
2025-05-07T20:32:34.7418015Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.7418299Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.7418615Z         x = x_sign * x_clamp
2025-05-07T20:32:34.7418866Z         x0 = x[:, :D]
2025-05-07T20:32:34.7419079Z         x1 = x[:, D:]
2025-05-07T20:32:34.7419298Z     
2025-05-07T20:32:34.7419495Z         if contiguous:
2025-05-07T20:32:34.7419728Z             x0 = x0.contiguous()
2025-05-07T20:32:34.7419986Z             x1 = x1.contiguous()
2025-05-07T20:32:34.7420230Z     
2025-05-07T20:32:34.7420430Z         if scale_ub is not None:
2025-05-07T20:32:34.7420695Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.7421033Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.7421445Z             )
2025-05-07T20:32:34.7421635Z         else:
2025-05-07T20:32:34.7421847Z             scale_ub_tensor = None
2025-05-07T20:32:34.7422208Z     
2025-05-07T20:32:34.7422445Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.7422761Z             op = silu_mul_quant
2025-05-07T20:32:34.7423010Z             if compiled:
2025-05-07T20:32:34.7423256Z                 op = torch.compile(op)
2025-05-07T20:32:34.7423553Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.7423830Z     
2025-05-07T20:32:34.7424081Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.7424248Z 
2025-05-07T20:32:34.7424348Z moe/activation_test.py:117: 
2025-05-07T20:32:34.7424648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.7424982Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.7425260Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.7426101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.7426932Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.7427553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.7428356Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.7429068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.7429600Z     kernel = self.compile(
2025-05-07T20:32:34.7430143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.7430790Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.7431186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.7431414Z 
2025-05-07T20:32:34.7431632Z self = <triton.compiler.compiler.ASTSource object at 0x7fb0169c9c10>
2025-05-07T20:32:34.7432704Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.7434057Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb016850cc0>}
2025-05-07T20:32:34.7435410Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.7436457Z context = <triton._C.libtriton.ir.context object at 0x7fb016862cb0>
2025-05-07T20:32:34.7436744Z 
2025-05-07T20:32:34.7436916Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.7437443Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.7437915Z                            module_map=module_map)
2025-05-07T20:32:34.7438279Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.7438629Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.7438896Z E       ^
2025-05-07T20:32:34.7439360Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.7439804Z 
2025-05-07T20:32:34.7440221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.7440726Z 
2025-05-07T20:32:34.7440832Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.7441244Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.7441651Z     T=2048,
2025-05-07T20:32:34.7441840Z     D=7168,
2025-05-07T20:32:34.7442082Z     scale_ub=None,
2025-05-07T20:32:34.7442297Z     contiguous=False,
2025-05-07T20:32:34.7442523Z     compiled=False,
2025-05-07T20:32:34.7442807Z )
2025-05-07T20:32:34.7443127Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.7443616Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:34.7443889Z 
2025-05-07T20:32:34.7443967Z     @given(
2025-05-07T20:32:34.7444203Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.7444521Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.7444869Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.7445200Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.7445566Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.7445876Z     )
2025-05-07T20:32:34.7446230Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.7446681Z     def test_silu_mul_quant(
2025-05-07T20:32:34.7446929Z         self,
2025-05-07T20:32:34.7447132Z         T: int,
2025-05-07T20:32:34.7447340Z         D: int,
2025-05-07T20:32:34.7447568Z         scale_ub: Optional[float],
2025-05-07T20:32:34.7447848Z         contiguous: bool,
2025-05-07T20:32:34.7448094Z         compiled: bool,
2025-05-07T20:32:34.7448319Z     ) -> None:
2025-05-07T20:32:34.7448539Z         torch.manual_seed(2025)
2025-05-07T20:32:34.7448788Z     
2025-05-07T20:32:34.7449068Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.7451098Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.7452937Z 
2025-05-07T20:32:34.7453123Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.7453337Z 
2025-05-07T20:32:34.7453440Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.7453853Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.7454250Z     T=128,
2025-05-07T20:32:34.7454437Z     D=7168,
2025-05-07T20:32:34.7454630Z     scale_ub=1200.0,
2025-05-07T20:32:34.7454858Z     contiguous=True,
2025-05-07T20:32:34.7455076Z     compiled=True,
2025-05-07T20:32:34.7455284Z )
2025-05-07T20:32:34.7766812Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.7767341Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:34.7767604Z 
2025-05-07T20:32:34.7767685Z     @given(
2025-05-07T20:32:34.7767924Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.7768243Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.7768551Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.7768886Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.7769219Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.7769509Z     )
2025-05-07T20:32:34.7769856Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.7770306Z     def test_silu_mul_quant(
2025-05-07T20:32:34.7770556Z         self,
2025-05-07T20:32:34.7770753Z         T: int,
2025-05-07T20:32:34.7770957Z         D: int,
2025-05-07T20:32:34.7771181Z         scale_ub: Optional[float],
2025-05-07T20:32:34.7771450Z         contiguous: bool,
2025-05-07T20:32:34.7771692Z         compiled: bool,
2025-05-07T20:32:34.7771922Z     ) -> None:
2025-05-07T20:32:34.7772136Z         torch.manual_seed(2025)
2025-05-07T20:32:34.7772487Z     
2025-05-07T20:32:34.7772765Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.7773176Z     
2025-05-07T20:32:34.7773492Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.7773789Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.7774092Z         x = x_sign * x_clamp
2025-05-07T20:32:34.7774334Z         x0 = x[:, :D]
2025-05-07T20:32:34.7774558Z         x1 = x[:, D:]
2025-05-07T20:32:34.7774771Z     
2025-05-07T20:32:34.7774959Z         if contiguous:
2025-05-07T20:32:34.7775203Z             x0 = x0.contiguous()
2025-05-07T20:32:34.7775525Z             x1 = x1.contiguous()
2025-05-07T20:32:34.7775763Z     
2025-05-07T20:32:34.7775960Z         if scale_ub is not None:
2025-05-07T20:32:34.7776234Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.7776564Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.7776880Z             )
2025-05-07T20:32:34.7777078Z         else:
2025-05-07T20:32:34.7777292Z             scale_ub_tensor = None
2025-05-07T20:32:34.7777549Z     
2025-05-07T20:32:34.7777785Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.7778105Z             op = silu_mul_quant
2025-05-07T20:32:34.7778360Z             if compiled:
2025-05-07T20:32:34.7778614Z                 op = torch.compile(op)
2025-05-07T20:32:34.7778909Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.7779190Z     
2025-05-07T20:32:34.7779391Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.7779556Z 
2025-05-07T20:32:34.7779660Z moe/activation_test.py:117: 
2025-05-07T20:32:34.7788775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.7789126Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.7789404Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.7789956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.7790511Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.7791168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.7791840Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.7792369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.7793030Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.7793682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.7794201Z     kernel = self.compile(
2025-05-07T20:32:34.7794734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.7795371Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.7795760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.7795986Z 
2025-05-07T20:32:34.7796191Z self = <triton.compiler.compiler.ASTSource object at 0x7fb01681ab70>
2025-05-07T20:32:34.7797256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.7798602Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb016851a80>}
2025-05-07T20:32:34.7799914Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.7800909Z context = <triton._C.libtriton.ir.context object at 0x7fb0168f06b0>
2025-05-07T20:32:34.7801196Z 
2025-05-07T20:32:34.7801454Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.7802037Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.7802496Z                            module_map=module_map)
2025-05-07T20:32:34.7802853Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.7803197Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.7803448Z E       ^
2025-05-07T20:32:34.7803903Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.7804389Z 
2025-05-07T20:32:34.7804798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.7805304Z 
2025-05-07T20:32:34.7805431Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.7805869Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.7806269Z     T=128,
2025-05-07T20:32:34.7806457Z     D=7168,
2025-05-07T20:32:34.7806647Z     scale_ub=1200.0,
2025-05-07T20:32:34.7806860Z     contiguous=True,
2025-05-07T20:32:34.7807082Z     compiled=False,
2025-05-07T20:32:34.7807286Z )
2025-05-07T20:32:34.7807592Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.7808073Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:34.7808339Z 
2025-05-07T20:32:34.7808413Z     @given(
2025-05-07T20:32:34.7808636Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.7808940Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.7809237Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.7809556Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.7809872Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.7810152Z     )
2025-05-07T20:32:34.7810493Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.7810920Z     def test_silu_mul_quant(
2025-05-07T20:32:34.7811154Z         self,
2025-05-07T20:32:34.7811350Z         T: int,
2025-05-07T20:32:34.7811559Z         D: int,
2025-05-07T20:32:34.7811771Z         scale_ub: Optional[float],
2025-05-07T20:32:34.7812036Z         contiguous: bool,
2025-05-07T20:32:34.7812272Z         compiled: bool,
2025-05-07T20:32:34.7812489Z     ) -> None:
2025-05-07T20:32:34.7812697Z         torch.manual_seed(2025)
2025-05-07T20:32:34.7812930Z     
2025-05-07T20:32:34.7813246Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.7813582Z     
2025-05-07T20:32:34.7813773Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.7814049Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.7816082Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.7817897Z 
2025-05-07T20:32:34.7818012Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:34.7818222Z 
2025-05-07T20:32:34.7818320Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.7818723Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.7819103Z     T=128,
2025-05-07T20:32:34.7819288Z     D=5120,
2025-05-07T20:32:34.7819477Z     scale_ub=1200.0,
2025-05-07T20:32:34.7819689Z     contiguous=True,
2025-05-07T20:32:34.7819903Z     compiled=True,
2025-05-07T20:32:34.7820099Z )
2025-05-07T20:32:34.7820409Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.7820928Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:34.7821266Z 
2025-05-07T20:32:34.7821344Z     @given(
2025-05-07T20:32:34.7821566Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.7821863Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.7822160Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.7822479Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.7822794Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.7823114Z     )
2025-05-07T20:32:34.7823451Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.7823878Z     def test_silu_mul_quant(
2025-05-07T20:32:34.7824106Z         self,
2025-05-07T20:32:34.7824297Z         T: int,
2025-05-07T20:32:34.7824483Z         D: int,
2025-05-07T20:32:34.7824689Z         scale_ub: Optional[float],
2025-05-07T20:32:34.7824954Z         contiguous: bool,
2025-05-07T20:32:34.7825188Z         compiled: bool,
2025-05-07T20:32:34.7825401Z     ) -> None:
2025-05-07T20:32:34.7825616Z         torch.manual_seed(2025)
2025-05-07T20:32:34.7825853Z     
2025-05-07T20:32:34.7826113Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.7826445Z     
2025-05-07T20:32:34.7826633Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.7826914Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.7828859Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.7830676Z 
2025-05-07T20:32:34.7830797Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:34.7831007Z 
2025-05-07T20:32:34.7831107Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.7831507Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.7831891Z     T=128,
2025-05-07T20:32:34.7832070Z     D=7168,
2025-05-07T20:32:34.7832253Z     scale_ub=None,
2025-05-07T20:32:34.7832455Z     contiguous=True,
2025-05-07T20:32:34.7832674Z     compiled=True,
2025-05-07T20:32:34.7832871Z )
2025-05-07T20:32:35.2583968Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.2584480Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.2584749Z 
2025-05-07T20:32:35.2584828Z     @given(
2025-05-07T20:32:35.2585060Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.2585388Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.2585727Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.2586054Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.2586368Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.2586650Z     )
2025-05-07T20:32:35.2586994Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.2587426Z     def test_silu_mul_quant(
2025-05-07T20:32:35.2587657Z         self,
2025-05-07T20:32:35.2587852Z         T: int,
2025-05-07T20:32:35.2588050Z         D: int,
2025-05-07T20:32:35.2588260Z         scale_ub: Optional[float],
2025-05-07T20:32:35.2588526Z         contiguous: bool,
2025-05-07T20:32:35.2588761Z         compiled: bool,
2025-05-07T20:32:35.2588982Z     ) -> None:
2025-05-07T20:32:35.2589194Z         torch.manual_seed(2025)
2025-05-07T20:32:35.2589433Z     
2025-05-07T20:32:35.2589698Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2591993Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.2593872Z 
2025-05-07T20:32:35.2593992Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.2594204Z 
2025-05-07T20:32:35.2629659Z FAILED
2025-05-07T20:32:35.2630034Z 
2025-05-07T20:32:35.2630496Z =================================== FAILURES ===================================
2025-05-07T20:32:35.2631152Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:35.2631779Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:35.2632649Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:32:35.2633400Z   |     yield
2025-05-07T20:32:35.2633993Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run
2025-05-07T20:32:35.2634689Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:35.2635457Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
2025-05-07T20:32:35.2636224Z   |     if method() is not None:
2025-05-07T20:32:35.2636577Z   |        ^^^^^^^^
2025-05-07T20:32:35.2637433Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:35.2638432Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.2638850Z   |            ^^^^^^^
2025-05-07T20:32:35.2639617Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:35.2640477Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:35.2641064Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:35.2641645Z   +-+---------------- 1 ----------------
2025-05-07T20:32:35.2642034Z     | Traceback (most recent call last):
2025-05-07T20:32:35.2643019Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:35.2644081Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2644586Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:35.2647355Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.2650122Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:35.2650715Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2651280Z     |     T=2048,
2025-05-07T20:32:35.2651604Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:35.2652058Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:35.2652564Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:35.2653327Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:35.2653740Z     | )
2025-05-07T20:32:35.2653991Z     | 
2025-05-07T20:32:35.2654914Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:35.2655754Z     +---------------- 2 ----------------
2025-05-07T20:32:35.2656149Z     | Traceback (most recent call last):
2025-05-07T20:32:35.2657123Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:35.2658258Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2658780Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:35.2661717Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.2664434Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:35.2665033Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2665630Z     |     T=128,
2025-05-07T20:32:35.2665940Z     |     D=7168,
2025-05-07T20:32:35.2666239Z     |     scale_ub=None,
2025-05-07T20:32:35.2666496Z     |     contiguous=True,
2025-05-07T20:32:35.2666766Z     |     compiled=True,
2025-05-07T20:32:35.2667004Z     | )
2025-05-07T20:32:35.2667204Z     | 
2025-05-07T20:32:35.2667739Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:35.2668345Z     +---------------- 3 ----------------
2025-05-07T20:32:35.2668648Z     | Traceback (most recent call last):
2025-05-07T20:32:35.2670070Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:35.2670858Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2671235Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:35.2673423Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.2675512Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:35.2675969Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2676388Z     |     T=128,
2025-05-07T20:32:35.2676600Z     |     D=5120,
2025-05-07T20:32:35.2676827Z     |     scale_ub=1200.0,
2025-05-07T20:32:35.2677085Z     |     contiguous=True,
2025-05-07T20:32:35.2677339Z     |     compiled=True,
2025-05-07T20:32:35.2677586Z     | )
2025-05-07T20:32:35.2677784Z     | 
2025-05-07T20:32:35.2678312Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:35.2678925Z     +---------------- 4 ----------------
2025-05-07T20:32:35.2679228Z     | Traceback (most recent call last):
2025-05-07T20:32:35.2680191Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:35.2680918Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.2681219Z     |                              ^^^^^^^^
2025-05-07T20:32:35.2681867Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:35.2682559Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.2682984Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:35.2683787Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:35.2684588Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.2685204Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:35.2685949Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.2686528Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:35.2687385Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:35.2688437Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.2689080Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:35.2689949Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:35.2690899Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.2691424Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:35.2692259Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:35.2693127Z     |     fn()
2025-05-07T20:32:35.2693914Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:35.2694785Z     |     self.fn.run(
2025-05-07T20:32:35.2695524Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:35.2696322Z     |     kernel = self.compile(
2025-05-07T20:32:35.2696691Z     |              ^^^^^^^^^^^^^
2025-05-07T20:32:35.2697512Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:35.2698493Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.2699024Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:35.2699928Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:35.2701011Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.2701673Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:35.2834115Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.2834643Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.2835013Z     | ^
2025-05-07T20:32:35.2835651Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.2836424Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:35.2837319Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:35.2838212Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2838800Z     |     T=1,  # or any other generated value
2025-05-07T20:32:35.2839232Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:35.2839695Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:35.2840185Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:35.2840684Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:35.2841227Z     | )
2025-05-07T20:32:35.2841421Z     | 
2025-05-07T20:32:35.2841955Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:35.2842560Z     +------------------------------------
2025-05-07T20:32:35.2842928Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:35.2843297Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.2843715Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2844121Z     T=1,
2025-05-07T20:32:35.2844299Z     D=5120,
2025-05-07T20:32:35.2844488Z     scale_ub=None,
2025-05-07T20:32:35.2844704Z     contiguous=True,
2025-05-07T20:32:35.2844918Z     compiled=True,
2025-05-07T20:32:35.2845128Z )
2025-05-07T20:32:35.2845452Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.2845969Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.2846240Z 
2025-05-07T20:32:35.2846316Z     @given(
2025-05-07T20:32:35.2846545Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.2846856Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.2847153Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.2847477Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.2847806Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.2848086Z     )
2025-05-07T20:32:35.2848436Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.2848877Z     def test_silu_mul_quant(
2025-05-07T20:32:35.2849113Z         self,
2025-05-07T20:32:35.2849308Z         T: int,
2025-05-07T20:32:35.2849509Z         D: int,
2025-05-07T20:32:35.2849726Z         scale_ub: Optional[float],
2025-05-07T20:32:35.2849997Z         contiguous: bool,
2025-05-07T20:32:35.2850242Z         compiled: bool,
2025-05-07T20:32:35.2850469Z     ) -> None:
2025-05-07T20:32:35.2850677Z         torch.manual_seed(2025)
2025-05-07T20:32:35.2850918Z     
2025-05-07T20:32:35.2851194Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2851529Z     
2025-05-07T20:32:35.2851725Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.2852015Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.2852320Z         x = x_sign * x_clamp
2025-05-07T20:32:35.2852563Z         x0 = x[:, :D]
2025-05-07T20:32:35.2852780Z         x1 = x[:, D:]
2025-05-07T20:32:35.2853094Z     
2025-05-07T20:32:35.2853299Z         if contiguous:
2025-05-07T20:32:35.2853529Z             x0 = x0.contiguous()
2025-05-07T20:32:35.2853781Z             x1 = x1.contiguous()
2025-05-07T20:32:35.2854021Z     
2025-05-07T20:32:35.2854212Z         if scale_ub is not None:
2025-05-07T20:32:35.2854477Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.2854814Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.2855124Z             )
2025-05-07T20:32:35.2855310Z         else:
2025-05-07T20:32:35.2855535Z             scale_ub_tensor = None
2025-05-07T20:32:35.2855816Z     
2025-05-07T20:32:35.2856046Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.2856351Z             op = silu_mul_quant
2025-05-07T20:32:35.2856606Z             if compiled:
2025-05-07T20:32:35.2856914Z                 op = torch.compile(op)
2025-05-07T20:32:35.2857199Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2857472Z     
2025-05-07T20:32:35.2857747Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.2858030Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.2858318Z     
2025-05-07T20:32:35.2858556Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.2858881Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.2859172Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.2859876Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.2860233Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.2860535Z     
2025-05-07T20:32:35.2860738Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.2860931Z 
2025-05-07T20:32:35.2861036Z moe/activation_test.py:126: 
2025-05-07T20:32:35.2861325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2878768Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.2879122Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.2879897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.2880642Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.2881182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.2881868Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.2882543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.2883255Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.2883970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.2884609Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.2885200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.2885754Z     fn()
2025-05-07T20:32:35.2886252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.2886819Z     self.fn.run(
2025-05-07T20:32:35.2887282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.2887807Z     kernel = self.compile(
2025-05-07T20:32:35.2888333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.2888982Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.2889368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2889598Z 
2025-05-07T20:32:35.2889812Z self = <triton.compiler.compiler.ASTSource object at 0x7fb10b153170>
2025-05-07T20:32:35.2890866Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.2892230Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb1098d4c20>}
2025-05-07T20:32:35.2893641Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.2894650Z context = <triton._C.libtriton.ir.context object at 0x7fb10b06b2f0>
2025-05-07T20:32:35.2895064Z 
2025-05-07T20:32:35.2895229Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.2895860Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.2896323Z                            module_map=module_map)
2025-05-07T20:32:35.2896684Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.2897033Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.2897299Z E       ^
2025-05-07T20:32:35.2897749Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.2898250Z 
2025-05-07T20:32:35.2898659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.2899156Z 
2025-05-07T20:32:35.2899258Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.2899662Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2900055Z     T=2048,
2025-05-07T20:32:35.2900243Z     D=5120,
2025-05-07T20:32:35.2900430Z     scale_ub=1200.0,
2025-05-07T20:32:35.2900647Z     contiguous=True,
2025-05-07T20:32:35.2900864Z     compiled=False,
2025-05-07T20:32:35.2901062Z )
2025-05-07T20:32:35.2901377Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.2901861Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.2902124Z 
2025-05-07T20:32:35.2902199Z     @given(
2025-05-07T20:32:35.2902425Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.2902735Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.2903031Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.2903359Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.2903685Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.2903965Z     )
2025-05-07T20:32:35.2904315Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.2904757Z     def test_silu_mul_quant(
2025-05-07T20:32:35.2905006Z         self,
2025-05-07T20:32:35.2905198Z         T: int,
2025-05-07T20:32:35.2905405Z         D: int,
2025-05-07T20:32:35.2905667Z         scale_ub: Optional[float],
2025-05-07T20:32:35.2905942Z         contiguous: bool,
2025-05-07T20:32:35.2906192Z         compiled: bool,
2025-05-07T20:32:35.2906419Z     ) -> None:
2025-05-07T20:32:35.2906620Z         torch.manual_seed(2025)
2025-05-07T20:32:35.2906853Z     
2025-05-07T20:32:35.2907120Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2907458Z     
2025-05-07T20:32:35.2907651Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.2907942Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.2908248Z         x = x_sign * x_clamp
2025-05-07T20:32:35.2908488Z         x0 = x[:, :D]
2025-05-07T20:32:35.2908704Z         x1 = x[:, D:]
2025-05-07T20:32:35.2908914Z     
2025-05-07T20:32:35.2909104Z         if contiguous:
2025-05-07T20:32:35.2909340Z             x0 = x0.contiguous()
2025-05-07T20:32:35.2909600Z             x1 = x1.contiguous()
2025-05-07T20:32:35.2909835Z     
2025-05-07T20:32:35.2910028Z         if scale_ub is not None:
2025-05-07T20:32:35.2910298Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.2910626Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.2910935Z             )
2025-05-07T20:32:35.2911128Z         else:
2025-05-07T20:32:35.2911333Z             scale_ub_tensor = None
2025-05-07T20:32:35.2911588Z     
2025-05-07T20:32:35.2911820Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.2912128Z             op = silu_mul_quant
2025-05-07T20:32:35.2912378Z             if compiled:
2025-05-07T20:32:35.2912628Z                 op = torch.compile(op)
2025-05-07T20:32:35.2912913Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2913241Z     
2025-05-07T20:32:35.2913438Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.2913599Z 
2025-05-07T20:32:35.2913710Z moe/activation_test.py:117: 
2025-05-07T20:32:35.2914074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2914402Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.2914681Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2915356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.2916082Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.2916611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.2917282Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.2917931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.2918462Z     kernel = self.compile(
2025-05-07T20:32:35.2919004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.2919647Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.2920043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2920279Z 
2025-05-07T20:32:35.2920484Z self = <triton.compiler.compiler.ASTSource object at 0x7fb109a6c890>
2025-05-07T20:32:35.2921549Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.2922897Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb109990180>}
2025-05-07T20:32:35.2924221Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.2925238Z context = <triton._C.libtriton.ir.context object at 0x7fb1094c9970>
2025-05-07T20:32:35.2925521Z 
2025-05-07T20:32:35.2925694Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.2926215Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.2926676Z                            module_map=module_map)
2025-05-07T20:32:35.2927033Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.2927382Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.2927632Z E       ^
2025-05-07T20:32:35.2928086Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.2928530Z 
2025-05-07T20:32:35.2928947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.2929445Z 
2025-05-07T20:32:35.2929554Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.2929954Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2930346Z     T=2048,
2025-05-07T20:32:35.2930529Z     D=5120,
2025-05-07T20:32:35.2930714Z     scale_ub=1200.0,
2025-05-07T20:32:35.2930933Z     contiguous=True,
2025-05-07T20:32:35.2931150Z     compiled=True,
2025-05-07T20:32:35.2931356Z )
2025-05-07T20:32:35.2931671Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.2932156Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.2932418Z 
2025-05-07T20:32:35.2932501Z     @given(
2025-05-07T20:32:35.2932724Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.2933180Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.2933483Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.2933880Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.2934201Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.2934478Z     )
2025-05-07T20:32:35.2934813Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.2935245Z     def test_silu_mul_quant(
2025-05-07T20:32:35.2935488Z         self,
2025-05-07T20:32:35.2935705Z         T: int,
2025-05-07T20:32:35.2935949Z         D: int,
2025-05-07T20:32:35.2936162Z         scale_ub: Optional[float],
2025-05-07T20:32:35.2936421Z         contiguous: bool,
2025-05-07T20:32:35.2936653Z         compiled: bool,
2025-05-07T20:32:35.2936871Z     ) -> None:
2025-05-07T20:32:35.2937077Z         torch.manual_seed(2025)
2025-05-07T20:32:35.2937317Z     
2025-05-07T20:32:35.2937579Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2937911Z     
2025-05-07T20:32:35.2938098Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.2938387Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.2938686Z         x = x_sign * x_clamp
2025-05-07T20:32:35.2938916Z         x0 = x[:, :D]
2025-05-07T20:32:35.2939127Z         x1 = x[:, D:]
2025-05-07T20:32:35.2939326Z     
2025-05-07T20:32:35.2939501Z         if contiguous:
2025-05-07T20:32:35.2939724Z             x0 = x0.contiguous()
2025-05-07T20:32:35.2939969Z             x1 = x1.contiguous()
2025-05-07T20:32:35.2940198Z     
2025-05-07T20:32:35.2940392Z         if scale_ub is not None:
2025-05-07T20:32:35.2940655Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.2940974Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.2941273Z             )
2025-05-07T20:32:35.2941462Z         else:
2025-05-07T20:32:35.2941661Z             scale_ub_tensor = None
2025-05-07T20:32:35.2941906Z     
2025-05-07T20:32:35.2942134Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.2942434Z             op = silu_mul_quant
2025-05-07T20:32:35.2942673Z             if compiled:
2025-05-07T20:32:35.2942911Z                 op = torch.compile(op)
2025-05-07T20:32:35.2943193Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2943456Z     
2025-05-07T20:32:35.2943641Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.2943918Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.2944196Z     
2025-05-07T20:32:35.2944427Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.2944756Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.2945036Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.2945340Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.2945713Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.2946037Z     
2025-05-07T20:32:35.2946228Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.2946422Z 
2025-05-07T20:32:35.2946519Z moe/activation_test.py:126: 
2025-05-07T20:32:35.2946815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2947136Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.2947455Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.2948219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.2948951Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.2949485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.2950144Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.2950808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.2951556Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.2952362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.2952984Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.2953568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.2954070Z     fn()
2025-05-07T20:32:35.2954575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.2955185Z     self.fn.run(
2025-05-07T20:32:35.2955643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.2956166Z     kernel = self.compile(
2025-05-07T20:32:35.2956697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.2957342Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.2957733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2957963Z 
2025-05-07T20:32:35.2958166Z self = <triton.compiler.compiler.ASTSource object at 0x7fb109412450>
2025-05-07T20:32:35.2959428Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.2960812Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10852d260>}
2025-05-07T20:32:35.2962122Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.2963137Z context = <triton._C.libtriton.ir.context object at 0x7fb103c54eb0>
2025-05-07T20:32:35.2963425Z 
2025-05-07T20:32:35.2963588Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.2964098Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.2964551Z                            module_map=module_map)
2025-05-07T20:32:35.2964912Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.2965271Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.2965559Z E       ^
2025-05-07T20:32:35.2966032Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.2966477Z 
2025-05-07T20:32:35.2966886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.2967391Z 
2025-05-07T20:32:35.2967499Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.2967905Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2968303Z     T=16384,
2025-05-07T20:32:35.2968496Z     D=7168,
2025-05-07T20:32:35.2968688Z     scale_ub=1200.0,
2025-05-07T20:32:35.2968909Z     contiguous=False,
2025-05-07T20:32:35.2969131Z     compiled=False,
2025-05-07T20:32:35.2969331Z )
2025-05-07T20:32:35.2969638Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.2970134Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.2970408Z 
2025-05-07T20:32:35.2970493Z     @given(
2025-05-07T20:32:35.2970717Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.2971026Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.2971330Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.2971734Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.2972056Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.2972445Z     )
2025-05-07T20:32:35.2972792Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.2973290Z     def test_silu_mul_quant(
2025-05-07T20:32:35.2973527Z         self,
2025-05-07T20:32:35.2973720Z         T: int,
2025-05-07T20:32:35.2973909Z         D: int,
2025-05-07T20:32:35.2974128Z         scale_ub: Optional[float],
2025-05-07T20:32:35.2974395Z         contiguous: bool,
2025-05-07T20:32:35.2974695Z         compiled: bool,
2025-05-07T20:32:35.2974919Z     ) -> None:
2025-05-07T20:32:35.2975134Z         torch.manual_seed(2025)
2025-05-07T20:32:35.2975365Z     
2025-05-07T20:32:35.2975648Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2976024Z     
2025-05-07T20:32:35.2976215Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.2976506Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.2976810Z         x = x_sign * x_clamp
2025-05-07T20:32:35.2977049Z         x0 = x[:, :D]
2025-05-07T20:32:35.2977263Z         x1 = x[:, D:]
2025-05-07T20:32:35.2977474Z     
2025-05-07T20:32:35.2977659Z         if contiguous:
2025-05-07T20:32:35.2977885Z             x0 = x0.contiguous()
2025-05-07T20:32:35.2978149Z             x1 = x1.contiguous()
2025-05-07T20:32:35.2978391Z     
2025-05-07T20:32:35.2978578Z         if scale_ub is not None:
2025-05-07T20:32:35.2978847Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.2979181Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.2979480Z             )
2025-05-07T20:32:35.2979672Z         else:
2025-05-07T20:32:35.2979885Z             scale_ub_tensor = None
2025-05-07T20:32:35.2980127Z     
2025-05-07T20:32:35.2980355Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.2980668Z             op = silu_mul_quant
2025-05-07T20:32:35.2980909Z             if compiled:
2025-05-07T20:32:35.2981153Z                 op = torch.compile(op)
2025-05-07T20:32:35.2981453Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2981721Z     
2025-05-07T20:32:35.2981914Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.2982083Z 
2025-05-07T20:32:35.2982181Z moe/activation_test.py:117: 
2025-05-07T20:32:35.2982472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2982792Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.2983068Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2983745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.2984416Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.2984946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.2985616Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.2986323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.2986836Z     kernel = self.compile(
2025-05-07T20:32:35.2987368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.2988009Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.2988392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2988623Z 
2025-05-07T20:32:35.2988827Z self = <triton.compiler.compiler.ASTSource object at 0x7fb109b17710>
2025-05-07T20:32:35.2989888Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.2991357Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10852f060>}
2025-05-07T20:32:35.2992677Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.2993683Z context = <triton._C.libtriton.ir.context object at 0x7fb103c88030>
2025-05-07T20:32:35.2993971Z 
2025-05-07T20:32:35.2994133Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.2994685Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.2995148Z                            module_map=module_map)
2025-05-07T20:32:35.2995504Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.2995855Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.2996119Z E       ^
2025-05-07T20:32:35.2996578Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.2997023Z 
2025-05-07T20:32:35.2997432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.2997937Z 
2025-05-07T20:32:35.2998039Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.2998443Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2998834Z     T=1,
2025-05-07T20:32:35.2999008Z     D=7168,
2025-05-07T20:32:35.2999197Z     scale_ub=None,
2025-05-07T20:32:35.2999408Z     contiguous=True,
2025-05-07T20:32:35.2999634Z     compiled=True,
2025-05-07T20:32:35.2999838Z )
2025-05-07T20:32:35.3000152Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3000630Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.3000894Z 
2025-05-07T20:32:35.3000971Z     @given(
2025-05-07T20:32:35.3001207Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3001512Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3001815Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3002135Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3002451Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3002736Z     )
2025-05-07T20:32:35.3003077Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3003508Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3003742Z         self,
2025-05-07T20:32:35.3003939Z         T: int,
2025-05-07T20:32:35.3004133Z         D: int,
2025-05-07T20:32:35.3004354Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3004628Z         contiguous: bool,
2025-05-07T20:32:35.3004865Z         compiled: bool,
2025-05-07T20:32:35.3005079Z     ) -> None:
2025-05-07T20:32:35.3005293Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3005529Z     
2025-05-07T20:32:35.3005793Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3006128Z     
2025-05-07T20:32:35.3006321Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3006603Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3006906Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3007144Z         x0 = x[:, :D]
2025-05-07T20:32:35.3007351Z         x1 = x[:, D:]
2025-05-07T20:32:35.3007557Z     
2025-05-07T20:32:35.3007745Z         if contiguous:
2025-05-07T20:32:35.3007969Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3008222Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3008455Z     
2025-05-07T20:32:35.3008641Z         if scale_ub is not None:
2025-05-07T20:32:35.3008907Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3009232Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3009577Z             )
2025-05-07T20:32:35.3009767Z         else:
2025-05-07T20:32:35.3009972Z             scale_ub_tensor = None
2025-05-07T20:32:35.3010292Z     
2025-05-07T20:32:35.3010517Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3010827Z             op = silu_mul_quant
2025-05-07T20:32:35.3011076Z             if compiled:
2025-05-07T20:32:35.3011312Z                 op = torch.compile(op)
2025-05-07T20:32:35.3011602Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3011870Z     
2025-05-07T20:32:35.3012053Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.3012373Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.3012657Z     
2025-05-07T20:32:35.3012887Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3013264Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.3013549Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.3013850Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.3014198Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.3014501Z     
2025-05-07T20:32:35.3014705Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.3014893Z 
2025-05-07T20:32:35.3014987Z moe/activation_test.py:126: 
2025-05-07T20:32:35.3015273Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3015595Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.3015908Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.3016679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.3017414Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.3017946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3018606Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3019282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.3019985Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.3020698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.3021314Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.3021899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.3022406Z     fn()
2025-05-07T20:32:35.3022898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.3023463Z     self.fn.run(
2025-05-07T20:32:35.3023922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3024443Z     kernel = self.compile(
2025-05-07T20:32:35.3024973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3025606Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3025988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3026209Z 
2025-05-07T20:32:35.3026411Z self = <triton.compiler.compiler.ASTSource object at 0x7fb10841e000>
2025-05-07T20:32:35.3027467Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3028804Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb103820fe0>}
2025-05-07T20:32:35.3030259Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3039986Z context = <triton._C.libtriton.ir.context object at 0x7fb103a70bb0>
2025-05-07T20:32:35.3040279Z 
2025-05-07T20:32:35.3040447Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3040956Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3041492Z                            module_map=module_map)
2025-05-07T20:32:35.3041850Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3042200Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.3042459Z E       ^
2025-05-07T20:32:35.3042908Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3043349Z 
2025-05-07T20:32:35.3043768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3044273Z 
2025-05-07T20:32:35.3044374Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3044778Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3045164Z     T=4096,
2025-05-07T20:32:35.3045344Z     D=5120,
2025-05-07T20:32:35.3045527Z     scale_ub=None,
2025-05-07T20:32:35.3045747Z     contiguous=False,
2025-05-07T20:32:35.3046009Z     compiled=False,
2025-05-07T20:32:35.3046202Z )
2025-05-07T20:32:35.3046510Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3046987Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.3047255Z 
2025-05-07T20:32:35.3047328Z     @given(
2025-05-07T20:32:35.3047552Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3047852Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3048146Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3048467Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3048784Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3049056Z     )
2025-05-07T20:32:35.3049392Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3049819Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3050045Z         self,
2025-05-07T20:32:35.3050229Z         T: int,
2025-05-07T20:32:35.3050425Z         D: int,
2025-05-07T20:32:35.3050629Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3050894Z         contiguous: bool,
2025-05-07T20:32:35.3051127Z         compiled: bool,
2025-05-07T20:32:35.3051335Z     ) -> None:
2025-05-07T20:32:35.3051539Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3051768Z     
2025-05-07T20:32:35.3052028Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3052361Z     
2025-05-07T20:32:35.3052541Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3052821Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3053196Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3053425Z         x0 = x[:, :D]
2025-05-07T20:32:35.3053628Z         x1 = x[:, D:]
2025-05-07T20:32:35.3053822Z     
2025-05-07T20:32:35.3053997Z         if contiguous:
2025-05-07T20:32:35.3054216Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3054462Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3054693Z     
2025-05-07T20:32:35.3054877Z         if scale_ub is not None:
2025-05-07T20:32:35.3055134Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3055457Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3055753Z             )
2025-05-07T20:32:35.3055935Z         else:
2025-05-07T20:32:35.3056134Z             scale_ub_tensor = None
2025-05-07T20:32:35.3056427Z     
2025-05-07T20:32:35.3056644Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3056944Z             op = silu_mul_quant
2025-05-07T20:32:35.3057257Z             if compiled:
2025-05-07T20:32:35.3057492Z                 op = torch.compile(op)
2025-05-07T20:32:35.3057780Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3058041Z     
2025-05-07T20:32:35.3058222Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3058379Z 
2025-05-07T20:32:35.3058475Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3058759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3059123Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3059694Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3060386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3061059Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3061584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3062247Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3062889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3063401Z     kernel = self.compile(
2025-05-07T20:32:35.3063925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3064564Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3064945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3065164Z 
2025-05-07T20:32:35.3065372Z self = <triton.compiler.compiler.ASTSource object at 0x7fb10814f9e0>
2025-05-07T20:32:35.3066478Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3067819Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb103822160>}
2025-05-07T20:32:35.3069126Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3070129Z context = <triton._C.libtriton.ir.context object at 0x7fb103a977f0>
2025-05-07T20:32:35.3070413Z 
2025-05-07T20:32:35.3070579Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3071084Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3071535Z                            module_map=module_map)
2025-05-07T20:32:35.3071897Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3072240Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3072491Z E       ^
2025-05-07T20:32:35.3072944Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3073382Z 
2025-05-07T20:32:35.3073791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3074288Z 
2025-05-07T20:32:35.3074389Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3074790Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3075179Z     T=4096,
2025-05-07T20:32:35.3075358Z     D=7168,
2025-05-07T20:32:35.3075545Z     scale_ub=None,
2025-05-07T20:32:35.3075779Z     contiguous=False,
2025-05-07T20:32:35.3076015Z     compiled=False,
2025-05-07T20:32:35.3076328Z )
2025-05-07T20:32:35.3076635Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3077221Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.3077492Z 
2025-05-07T20:32:35.3077566Z     @given(
2025-05-07T20:32:35.3077787Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3078084Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3078374Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3078690Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3079071Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3079340Z     )
2025-05-07T20:32:35.3079675Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3080100Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3080328Z         self,
2025-05-07T20:32:35.3080516Z         T: int,
2025-05-07T20:32:35.3080702Z         D: int,
2025-05-07T20:32:35.3080910Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3081163Z         contiguous: bool,
2025-05-07T20:32:35.3081392Z         compiled: bool,
2025-05-07T20:32:35.3081605Z     ) -> None:
2025-05-07T20:32:35.3081803Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3082031Z     
2025-05-07T20:32:35.3082292Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3082619Z     
2025-05-07T20:32:35.3082804Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3083082Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3083382Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3083614Z         x0 = x[:, :D]
2025-05-07T20:32:35.3083821Z         x1 = x[:, D:]
2025-05-07T20:32:35.3084022Z     
2025-05-07T20:32:35.3084200Z         if contiguous:
2025-05-07T20:32:35.3084419Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3084665Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3084899Z     
2025-05-07T20:32:35.3085086Z         if scale_ub is not None:
2025-05-07T20:32:35.3085352Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3085679Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3085976Z             )
2025-05-07T20:32:35.3086157Z         else:
2025-05-07T20:32:35.3086353Z             scale_ub_tensor = None
2025-05-07T20:32:35.3086593Z     
2025-05-07T20:32:35.3086813Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3087111Z             op = silu_mul_quant
2025-05-07T20:32:35.3087352Z             if compiled:
2025-05-07T20:32:35.3087590Z                 op = torch.compile(op)
2025-05-07T20:32:35.3087874Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3088136Z     
2025-05-07T20:32:35.3088317Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3088475Z 
2025-05-07T20:32:35.3088567Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3088851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3089172Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3089437Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3090104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3090771Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3091288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3091944Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3092588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3093160Z     kernel = self.compile(
2025-05-07T20:32:35.3093686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3094313Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3094749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3094968Z 
2025-05-07T20:32:35.3095249Z self = <triton.compiler.compiler.ASTSource object at 0x7fb1083c1010>
2025-05-07T20:32:35.3096300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3097638Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb103823240>}
2025-05-07T20:32:35.3098983Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3099979Z context = <triton._C.libtriton.ir.context object at 0x7fb1030badf0>
2025-05-07T20:32:35.3100260Z 
2025-05-07T20:32:35.3100425Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3100935Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3101044Z                            module_map=module_map)
2025-05-07T20:32:35.3101200Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3101301Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3101378Z E       ^
2025-05-07T20:32:35.3101725Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3101732Z 
2025-05-07T20:32:35.3102136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3102141Z 
2025-05-07T20:32:35.3102238Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3102457Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3102537Z     T=128,
2025-05-07T20:32:35.3102612Z     D=7168,
2025-05-07T20:32:35.3102694Z     scale_ub=None,
2025-05-07T20:32:35.3102776Z     contiguous=False,
2025-05-07T20:32:35.3102856Z     compiled=True,
2025-05-07T20:32:35.3102934Z )
2025-05-07T20:32:35.3103146Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3103310Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.3103315Z 
2025-05-07T20:32:35.3103395Z     @given(
2025-05-07T20:32:35.3103511Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3103608Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3103723Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3103834Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3103943Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3104018Z     )
2025-05-07T20:32:35.3104255Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3104350Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3104424Z         self,
2025-05-07T20:32:35.3104495Z         T: int,
2025-05-07T20:32:35.3104571Z         D: int,
2025-05-07T20:32:35.3104668Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3104754Z         contiguous: bool,
2025-05-07T20:32:35.3104836Z         compiled: bool,
2025-05-07T20:32:35.3104909Z     ) -> None:
2025-05-07T20:32:35.3105000Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3105073Z     
2025-05-07T20:32:35.3105237Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3105313Z     
2025-05-07T20:32:35.3105401Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3105522Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3105610Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3105686Z         x0 = x[:, :D]
2025-05-07T20:32:35.3105836Z         x1 = x[:, D:]
2025-05-07T20:32:35.3105907Z     
2025-05-07T20:32:35.3105988Z         if contiguous:
2025-05-07T20:32:35.3106154Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3106244Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3106316Z     
2025-05-07T20:32:35.3106403Z         if scale_ub is not None:
2025-05-07T20:32:35.3106505Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3106634Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3106708Z             )
2025-05-07T20:32:35.3106782Z         else:
2025-05-07T20:32:35.3106916Z             scale_ub_tensor = None
2025-05-07T20:32:35.3106992Z     
2025-05-07T20:32:35.3107118Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3107204Z             op = silu_mul_quant
2025-05-07T20:32:35.3107287Z             if compiled:
2025-05-07T20:32:35.3107385Z                 op = torch.compile(op)
2025-05-07T20:32:35.3107484Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3107561Z     
2025-05-07T20:32:35.3107649Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.3107775Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.3107851Z     
2025-05-07T20:32:35.3107984Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3108089Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.3108187Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.3108305Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.3108446Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.3108515Z     
2025-05-07T20:32:35.3108613Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.3108617Z 
2025-05-07T20:32:35.3108716Z moe/activation_test.py:126: 
2025-05-07T20:32:35.3108842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3108940Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.3109080Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.3109631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.3109735Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.3110085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3110301Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3110666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.3110919Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.3111288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.3111454Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.3111791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.3111869Z     fn()
2025-05-07T20:32:35.3112260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.3112337Z     self.fn.run(
2025-05-07T20:32:35.3112671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3112762Z     kernel = self.compile(
2025-05-07T20:32:35.3113138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3113306Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3113433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3113438Z 
2025-05-07T20:32:35.3113693Z self = <triton.compiler.compiler.ASTSource object at 0x7fb109a9fdd0>
2025-05-07T20:32:35.3114526Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3115028Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10822ede0>}
2025-05-07T20:32:35.3115759Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3115983Z context = <triton._C.libtriton.ir.context object at 0x7fb10371c1f0>
2025-05-07T20:32:35.3115994Z 
2025-05-07T20:32:35.3116156Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3116413Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3116525Z                            module_map=module_map)
2025-05-07T20:32:35.3116684Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3116784Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.3116860Z E       ^
2025-05-07T20:32:35.3117209Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3117214Z 
2025-05-07T20:32:35.3117624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3117628Z 
2025-05-07T20:32:35.3117728Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3117945Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3118025Z     T=128,
2025-05-07T20:32:35.3118101Z     D=7168,
2025-05-07T20:32:35.3118179Z     scale_ub=None,
2025-05-07T20:32:35.3118265Z     contiguous=False,
2025-05-07T20:32:35.3118346Z     compiled=False,
2025-05-07T20:32:35.3118419Z )
2025-05-07T20:32:35.3118636Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3118802Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.3118807Z 
2025-05-07T20:32:35.3118889Z     @given(
2025-05-07T20:32:35.3119006Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3119100Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3119217Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3119327Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3119436Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3119514Z     )
2025-05-07T20:32:35.3119752Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3119848Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3119921Z         self,
2025-05-07T20:32:35.3119993Z         T: int,
2025-05-07T20:32:35.3120069Z         D: int,
2025-05-07T20:32:35.3120167Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3120252Z         contiguous: bool,
2025-05-07T20:32:35.3120341Z         compiled: bool,
2025-05-07T20:32:35.3120415Z     ) -> None:
2025-05-07T20:32:35.3120504Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3120574Z     
2025-05-07T20:32:35.3120738Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3120814Z     
2025-05-07T20:32:35.3120904Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3121025Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3121110Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3121192Z         x0 = x[:, :D]
2025-05-07T20:32:35.3121269Z         x1 = x[:, D:]
2025-05-07T20:32:35.3121344Z     
2025-05-07T20:32:35.3121423Z         if contiguous:
2025-05-07T20:32:35.3121561Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3121653Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3121725Z     
2025-05-07T20:32:35.3121908Z         if scale_ub is not None:
2025-05-07T20:32:35.3122019Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3122150Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3122224Z             )
2025-05-07T20:32:35.3122302Z         else:
2025-05-07T20:32:35.3122395Z             scale_ub_tensor = None
2025-05-07T20:32:35.3122470Z     
2025-05-07T20:32:35.3122602Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3122736Z             op = silu_mul_quant
2025-05-07T20:32:35.3122825Z             if compiled:
2025-05-07T20:32:35.3122924Z                 op = torch.compile(op)
2025-05-07T20:32:35.3123027Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3123102Z     
2025-05-07T20:32:35.3123191Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3123198Z 
2025-05-07T20:32:35.3123297Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3123429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3123536Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3123634Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3124129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3124225Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3124581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3124800Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3125133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3125225Z     kernel = self.compile(
2025-05-07T20:32:35.3125617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3125828Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3125957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3125962Z 
2025-05-07T20:32:35.3126163Z self = <triton.compiler.compiler.ASTSource object at 0x7fb1083c0aa0>
2025-05-07T20:32:35.3126932Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3127428Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb109960220>}
2025-05-07T20:32:35.3128167Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3128360Z context = <triton._C.libtriton.ir.context object at 0x7fb103789230>
2025-05-07T20:32:35.3128365Z 
2025-05-07T20:32:35.3128527Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3128786Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3128891Z                            module_map=module_map)
2025-05-07T20:32:35.3129057Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3129156Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3129231Z E       ^
2025-05-07T20:32:35.3129581Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3129585Z 
2025-05-07T20:32:35.3129992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3130042Z 
2025-05-07T20:32:35.3130148Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3130442Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3130517Z     T=4096,
2025-05-07T20:32:35.3130596Z     D=5120,
2025-05-07T20:32:35.3130678Z     scale_ub=1200.0,
2025-05-07T20:32:35.3130762Z     contiguous=True,
2025-05-07T20:32:35.3130847Z     compiled=False,
2025-05-07T20:32:35.3130917Z )
2025-05-07T20:32:35.3131130Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3131371Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.3131376Z 
2025-05-07T20:32:35.3131453Z     @given(
2025-05-07T20:32:35.3131570Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3131670Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3131782Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3131904Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3132021Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3132096Z     )
2025-05-07T20:32:35.3132340Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3132432Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3132503Z         self,
2025-05-07T20:32:35.3132580Z         T: int,
2025-05-07T20:32:35.3132653Z         D: int,
2025-05-07T20:32:35.3132748Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3132843Z         contiguous: bool,
2025-05-07T20:32:35.3132923Z         compiled: bool,
2025-05-07T20:32:35.3133055Z     ) -> None:
2025-05-07T20:32:35.3133146Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3133213Z     
2025-05-07T20:32:35.3133384Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3133457Z     
2025-05-07T20:32:35.3133548Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3133675Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3133764Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3133844Z         x0 = x[:, :D]
2025-05-07T20:32:35.3133924Z         x1 = x[:, D:]
2025-05-07T20:32:35.3133993Z     
2025-05-07T20:32:35.3134074Z         if contiguous:
2025-05-07T20:32:35.3134169Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3134257Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3134329Z     
2025-05-07T20:32:35.3134426Z         if scale_ub is not None:
2025-05-07T20:32:35.3134530Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3134664Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3134736Z             )
2025-05-07T20:32:35.3134811Z         else:
2025-05-07T20:32:35.3134901Z             scale_ub_tensor = None
2025-05-07T20:32:35.3134968Z     
2025-05-07T20:32:35.3135094Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3135184Z             op = silu_mul_quant
2025-05-07T20:32:35.3135270Z             if compiled:
2025-05-07T20:32:35.3135368Z                 op = torch.compile(op)
2025-05-07T20:32:35.3135481Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3135550Z     
2025-05-07T20:32:35.3135639Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3135649Z 
2025-05-07T20:32:35.3135766Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3135908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3136023Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3136118Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3136608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3136708Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3137059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3137335Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3137741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3137831Z     kernel = self.compile(
2025-05-07T20:32:35.3138209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3138383Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3138507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3138549Z 
2025-05-07T20:32:35.3138755Z self = <triton.compiler.compiler.ASTSource object at 0x7fb108348a40>
2025-05-07T20:32:35.3139512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3140019Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb1038b09a0>}
2025-05-07T20:32:35.3140748Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3140939Z context = <triton._C.libtriton.ir.context object at 0x7fb102962bf0>
2025-05-07T20:32:35.3140947Z 
2025-05-07T20:32:35.3141106Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3141361Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3141473Z                            module_map=module_map)
2025-05-07T20:32:35.3141633Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3141732Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3141814Z E       ^
2025-05-07T20:32:35.3142165Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3142170Z 
2025-05-07T20:32:35.3142575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3142580Z 
2025-05-07T20:32:35.3142678Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3142893Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3142976Z     T=1,
2025-05-07T20:32:35.3143050Z     D=5120,
2025-05-07T20:32:35.3143133Z     scale_ub=None,
2025-05-07T20:32:35.3143223Z     contiguous=True,
2025-05-07T20:32:35.3143304Z     compiled=True,
2025-05-07T20:32:35.3143378Z )
2025-05-07T20:32:35.3143593Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3143754Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.3143764Z 
2025-05-07T20:32:35.3143844Z     @given(
2025-05-07T20:32:35.3143965Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3144062Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3144177Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3144290Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3144399Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3144476Z     )
2025-05-07T20:32:35.3144715Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3144813Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3144886Z         self,
2025-05-07T20:32:35.3144963Z         T: int,
2025-05-07T20:32:35.3145040Z         D: int,
2025-05-07T20:32:35.3145135Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3145222Z         contiguous: bool,
2025-05-07T20:32:35.3145308Z         compiled: bool,
2025-05-07T20:32:35.3145432Z     ) -> None:
2025-05-07T20:32:35.3145523Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3145598Z     
2025-05-07T20:32:35.3145837Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3145909Z     
2025-05-07T20:32:35.3146002Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3146124Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3146216Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3146294Z         x0 = x[:, :D]
2025-05-07T20:32:35.3146373Z         x1 = x[:, D:]
2025-05-07T20:32:35.3146489Z     
2025-05-07T20:32:35.3146569Z         if contiguous:
2025-05-07T20:32:35.3146657Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3146749Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3146819Z     
2025-05-07T20:32:35.3146904Z         if scale_ub is not None:
2025-05-07T20:32:35.3147015Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3147146Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3147222Z             )
2025-05-07T20:32:35.3147300Z         else:
2025-05-07T20:32:35.3147396Z             scale_ub_tensor = None
2025-05-07T20:32:35.3147464Z     
2025-05-07T20:32:35.3147601Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3147687Z             op = silu_mul_quant
2025-05-07T20:32:35.3147769Z             if compiled:
2025-05-07T20:32:35.3147863Z                 op = torch.compile(op)
2025-05-07T20:32:35.3147963Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3148038Z     
2025-05-07T20:32:35.3148129Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.3148247Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.3148319Z     
2025-05-07T20:32:35.3148450Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3148548Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.3148647Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.3148767Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.3148910Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.3148983Z     
2025-05-07T20:32:35.3149080Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.3149085Z 
2025-05-07T20:32:35.3149187Z moe/activation_test.py:126: 
2025-05-07T20:32:35.3149313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3149418Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.3149551Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.3150099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.3150197Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.3150547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3150766Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3151135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.3151387Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.3151750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.3151917Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.3152250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.3152329Z     fn()
2025-05-07T20:32:35.3152726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.3152807Z     self.fn.run(
2025-05-07T20:32:35.3153139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3153275Z     kernel = self.compile(
2025-05-07T20:32:35.3153714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3153890Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3154012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3154017Z 
2025-05-07T20:32:35.3154221Z self = <triton.compiler.compiler.ASTSource object at 0x7fb103876540>
2025-05-07T20:32:35.3155025Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3155525Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb1038b2840>}
2025-05-07T20:32:35.3156316Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3156506Z context = <triton._C.libtriton.ir.context object at 0x7fb102952270>
2025-05-07T20:32:35.3156510Z 
2025-05-07T20:32:35.3156678Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3156933Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3157044Z                            module_map=module_map)
2025-05-07T20:32:35.3157199Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3157298Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.3157379Z E       ^
2025-05-07T20:32:35.3157726Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3157733Z 
2025-05-07T20:32:35.3158139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3158147Z 
2025-05-07T20:32:35.3158251Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3158468Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3158550Z     T=2048,
2025-05-07T20:32:35.3158624Z     D=5120,
2025-05-07T20:32:35.3158702Z     scale_ub=None,
2025-05-07T20:32:35.3158796Z     contiguous=True,
2025-05-07T20:32:35.3158876Z     compiled=True,
2025-05-07T20:32:35.3158949Z )
2025-05-07T20:32:35.3159165Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3159557Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.3159565Z 
2025-05-07T20:32:35.3159672Z     @given(
2025-05-07T20:32:35.3159793Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3159891Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3160010Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3160122Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3160233Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3160308Z     )
2025-05-07T20:32:35.3160547Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3160639Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3160718Z         self,
2025-05-07T20:32:35.3160795Z         T: int,
2025-05-07T20:32:35.3160868Z         D: int,
2025-05-07T20:32:35.3160974Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3161061Z         contiguous: bool,
2025-05-07T20:32:35.3161146Z         compiled: bool,
2025-05-07T20:32:35.3161222Z     ) -> None:
2025-05-07T20:32:35.3161312Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3161385Z     
2025-05-07T20:32:35.3161632Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3161704Z     
2025-05-07T20:32:35.3161925Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3162050Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3162137Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3162218Z         x0 = x[:, :D]
2025-05-07T20:32:35.3162296Z         x1 = x[:, D:]
2025-05-07T20:32:35.3162363Z     
2025-05-07T20:32:35.3162445Z         if contiguous:
2025-05-07T20:32:35.3162535Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3162627Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3162759Z     
2025-05-07T20:32:35.3162846Z         if scale_ub is not None:
2025-05-07T20:32:35.3162953Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3163083Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3163157Z             )
2025-05-07T20:32:35.3163233Z         else:
2025-05-07T20:32:35.3163325Z             scale_ub_tensor = None
2025-05-07T20:32:35.3163398Z     
2025-05-07T20:32:35.3163531Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3163623Z             op = silu_mul_quant
2025-05-07T20:32:35.3163705Z             if compiled:
2025-05-07T20:32:35.3163805Z                 op = torch.compile(op)
2025-05-07T20:32:35.3163904Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3163971Z     
2025-05-07T20:32:35.3164062Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.3164177Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.3164254Z     
2025-05-07T20:32:35.3164385Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3164483Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.3164582Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.3164698Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.3164835Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.3164912Z     
2025-05-07T20:32:35.3165010Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.3165014Z 
2025-05-07T20:32:35.3165116Z moe/activation_test.py:126: 
2025-05-07T20:32:35.3175262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3175451Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.3175628Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.3176216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.3176327Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.3176689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3176910Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3177281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.3177544Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.3177924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.3178101Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.3178443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.3178524Z     fn()
2025-05-07T20:32:35.3178924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.3179007Z     self.fn.run(
2025-05-07T20:32:35.3179345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3179440Z     kernel = self.compile(
2025-05-07T20:32:35.3179899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3180160Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3180294Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3180300Z 
2025-05-07T20:32:35.3180510Z self = <triton.compiler.compiler.ASTSource object at 0x7fb109a9c5c0>
2025-05-07T20:32:35.3181288Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3181829Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb1097e8e00>}
2025-05-07T20:32:35.3182566Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3182764Z context = <triton._C.libtriton.ir.context object at 0x7fb102880730>
2025-05-07T20:32:35.3182769Z 
2025-05-07T20:32:35.3182940Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3183200Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3183308Z                            module_map=module_map)
2025-05-07T20:32:35.3183479Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3183584Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.3183659Z E       ^
2025-05-07T20:32:35.3184014Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3184019Z 
2025-05-07T20:32:35.3184427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3184434Z 
2025-05-07T20:32:35.3184545Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3184766Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3184842Z     T=128,
2025-05-07T20:32:35.3184924Z     D=5120,
2025-05-07T20:32:35.3185008Z     scale_ub=None,
2025-05-07T20:32:35.3185092Z     contiguous=True,
2025-05-07T20:32:35.3185178Z     compiled=True,
2025-05-07T20:32:35.3185250Z )
2025-05-07T20:32:35.3185475Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3185675Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.3185680Z 
2025-05-07T20:32:35.3185776Z     @given(
2025-05-07T20:32:35.3185911Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3186014Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3186130Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3186254Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3186372Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3186453Z     )
2025-05-07T20:32:35.3186698Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3186792Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3186872Z         self,
2025-05-07T20:32:35.3186949Z         T: int,
2025-05-07T20:32:35.3187029Z         D: int,
2025-05-07T20:32:35.3187131Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3187224Z         contiguous: bool,
2025-05-07T20:32:35.3187310Z         compiled: bool,
2025-05-07T20:32:35.3187396Z     ) -> None:
2025-05-07T20:32:35.3187492Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3187567Z     
2025-05-07T20:32:35.3187743Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3187821Z     
2025-05-07T20:32:35.3187914Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3188093Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3188183Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3188344Z         x0 = x[:, :D]
2025-05-07T20:32:35.3188427Z         x1 = x[:, D:]
2025-05-07T20:32:35.3188502Z     
2025-05-07T20:32:35.3188592Z         if contiguous:
2025-05-07T20:32:35.3188683Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3188776Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3188853Z     
2025-05-07T20:32:35.3188943Z         if scale_ub is not None:
2025-05-07T20:32:35.3189050Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3189231Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3189309Z             )
2025-05-07T20:32:35.3189386Z         else:
2025-05-07T20:32:35.3189485Z             scale_ub_tensor = None
2025-05-07T20:32:35.3189561Z     
2025-05-07T20:32:35.3189696Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3189792Z             op = silu_mul_quant
2025-05-07T20:32:35.3189876Z             if compiled:
2025-05-07T20:32:35.3189982Z                 op = torch.compile(op)
2025-05-07T20:32:35.3190092Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3190169Z     
2025-05-07T20:32:35.3190269Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.3190395Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.3190506Z     
2025-05-07T20:32:35.3190679Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3190786Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.3190921Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.3191120Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.3191291Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.3191381Z     
2025-05-07T20:32:35.3191536Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.3191541Z 
2025-05-07T20:32:35.3195149Z moe/activation_test.py:126: 
2025-05-07T20:32:35.3195362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3195519Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.3195705Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.3217366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.3217501Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.3217867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3218095Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3218456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.3218711Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.3219086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.3219257Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.3219589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.3219666Z     fn()
2025-05-07T20:32:35.3220058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.3220143Z     self.fn.run(
2025-05-07T20:32:35.3220471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3220566Z     kernel = self.compile(
2025-05-07T20:32:35.3220935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3221106Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3221328Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3221410Z 
2025-05-07T20:32:35.3221614Z self = <triton.compiler.compiler.ASTSource object at 0x7fb103987710>
2025-05-07T20:32:35.3222379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3222915Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb102b08cc0>}
2025-05-07T20:32:35.3223651Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3223840Z context = <triton._C.libtriton.ir.context object at 0x7fb10340e9f0>
2025-05-07T20:32:35.3223844Z 
2025-05-07T20:32:35.3224008Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3224267Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3224373Z                            module_map=module_map)
2025-05-07T20:32:35.3224534Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3224634Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.3224711Z E       ^
2025-05-07T20:32:35.3225064Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3225069Z 
2025-05-07T20:32:35.3225472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3225476Z 
2025-05-07T20:32:35.3225576Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3225793Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3225873Z     T=4096,
2025-05-07T20:32:35.3225952Z     D=5120,
2025-05-07T20:32:35.3226029Z     scale_ub=None,
2025-05-07T20:32:35.3226111Z     contiguous=True,
2025-05-07T20:32:35.3226194Z     compiled=True,
2025-05-07T20:32:35.3226263Z )
2025-05-07T20:32:35.3226475Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3226645Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.3226653Z 
2025-05-07T20:32:35.3226725Z     @given(
2025-05-07T20:32:35.3226844Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3226941Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3227051Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3227167Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3227276Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3227351Z     )
2025-05-07T20:32:35.3227598Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3227688Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3227758Z         self,
2025-05-07T20:32:35.3227835Z         T: int,
2025-05-07T20:32:35.3227910Z         D: int,
2025-05-07T20:32:35.3228004Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3228094Z         contiguous: bool,
2025-05-07T20:32:35.3228174Z         compiled: bool,
2025-05-07T20:32:35.3228253Z     ) -> None:
2025-05-07T20:32:35.3228349Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3228419Z     
2025-05-07T20:32:35.3228590Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3228661Z     
2025-05-07T20:32:35.3228752Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3228879Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3228964Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3229092Z         x0 = x[:, :D]
2025-05-07T20:32:35.3229171Z         x1 = x[:, D:]
2025-05-07T20:32:35.3229242Z     
2025-05-07T20:32:35.3229423Z         if contiguous:
2025-05-07T20:32:35.3229518Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3229605Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3229680Z     
2025-05-07T20:32:35.3229767Z         if scale_ub is not None:
2025-05-07T20:32:35.3229870Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3230004Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3230111Z             )
2025-05-07T20:32:35.3230182Z         else:
2025-05-07T20:32:35.3230279Z             scale_ub_tensor = None
2025-05-07T20:32:35.3230349Z     
2025-05-07T20:32:35.3230473Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3230568Z             op = silu_mul_quant
2025-05-07T20:32:35.3230649Z             if compiled:
2025-05-07T20:32:35.3230744Z                 op = torch.compile(op)
2025-05-07T20:32:35.3230851Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3230921Z     
2025-05-07T20:32:35.3231017Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.3231134Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.3231204Z     
2025-05-07T20:32:35.3231340Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3231440Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.3231534Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.3231655Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.3231795Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.3231862Z     
2025-05-07T20:32:35.3231961Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.3231965Z 
2025-05-07T20:32:35.3232059Z moe/activation_test.py:126: 
2025-05-07T20:32:35.3232187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3232292Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.3232432Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.3232991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.3233101Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.3233453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3233672Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3234046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.3234299Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.3234670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.3234836Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.3235177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.3235258Z     fn()
2025-05-07T20:32:35.3235680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.3235767Z     self.fn.run(
2025-05-07T20:32:35.3236116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3236211Z     kernel = self.compile(
2025-05-07T20:32:35.3236588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3236763Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3236891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3236944Z 
2025-05-07T20:32:35.3237155Z self = <triton.compiler.compiler.ASTSource object at 0x7fb10319ed50>
2025-05-07T20:32:35.3237989Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3238493Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb102b0ad40>}
2025-05-07T20:32:35.3239265Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3239453Z context = <triton._C.libtriton.ir.context object at 0x7fb102500f30>
2025-05-07T20:32:35.3239458Z 
2025-05-07T20:32:35.3239630Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3239893Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3240006Z                            module_map=module_map)
2025-05-07T20:32:35.3240167Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3240272Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.3240360Z E       ^
2025-05-07T20:32:35.3240709Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3240717Z 
2025-05-07T20:32:35.3241131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3241136Z 
2025-05-07T20:32:35.3241238Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3241455Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3241535Z     T=16384,
2025-05-07T20:32:35.3241613Z     D=5120,
2025-05-07T20:32:35.3241693Z     scale_ub=None,
2025-05-07T20:32:35.3241781Z     contiguous=True,
2025-05-07T20:32:35.3241869Z     compiled=True,
2025-05-07T20:32:35.3241940Z )
2025-05-07T20:32:35.3242161Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3242334Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.3242338Z 
2025-05-07T20:32:35.3242420Z     @given(
2025-05-07T20:32:35.3242539Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3242642Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3242759Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3242875Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3242989Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3243066Z     )
2025-05-07T20:32:35.3243310Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3243404Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3243481Z         self,
2025-05-07T20:32:35.3243557Z         T: int,
2025-05-07T20:32:35.3243638Z         D: int,
2025-05-07T20:32:35.3243737Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3243824Z         contiguous: bool,
2025-05-07T20:32:35.3243916Z         compiled: bool,
2025-05-07T20:32:35.3243994Z     ) -> None:
2025-05-07T20:32:35.3244087Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3244163Z     
2025-05-07T20:32:35.3244329Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3244402Z     
2025-05-07T20:32:35.3244499Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3244621Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3244709Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3244792Z         x0 = x[:, :D]
2025-05-07T20:32:35.3244871Z         x1 = x[:, D:]
2025-05-07T20:32:35.3244944Z     
2025-05-07T20:32:35.3245076Z         if contiguous:
2025-05-07T20:32:35.3245166Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3245262Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3245409Z     
2025-05-07T20:32:35.3245499Z         if scale_ub is not None:
2025-05-07T20:32:35.3245610Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3245742Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3245816Z             )
2025-05-07T20:32:35.3245902Z         else:
2025-05-07T20:32:35.3245992Z             scale_ub_tensor = None
2025-05-07T20:32:35.3246066Z     
2025-05-07T20:32:35.3246239Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3246328Z             op = silu_mul_quant
2025-05-07T20:32:35.3246410Z             if compiled:
2025-05-07T20:32:35.3246517Z                 op = torch.compile(op)
2025-05-07T20:32:35.3246625Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3246704Z     
2025-05-07T20:32:35.3246794Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.3246917Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.3246995Z     
2025-05-07T20:32:35.3247136Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3247238Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.3247342Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.3247462Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.3247600Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.3247678Z     
2025-05-07T20:32:35.3247780Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.3247785Z 
2025-05-07T20:32:35.3247888Z moe/activation_test.py:126: 
2025-05-07T20:32:35.3248018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3248123Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.3248260Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.3248811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.3248915Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.3249273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3249493Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3249861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.3250114Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.3250483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.3250651Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.3250987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.3251072Z     fn()
2025-05-07T20:32:35.3251468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.3251550Z     self.fn.run(
2025-05-07T20:32:35.3251890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3251981Z     kernel = self.compile(
2025-05-07T20:32:35.3252355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3252536Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3252664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3252668Z 
2025-05-07T20:32:35.3252879Z self = <triton.compiler.compiler.ASTSource object at 0x7fb1024bb410>
2025-05-07T20:32:35.3253891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3254394Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb102c82840>}
2025-05-07T20:32:35.3255132Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3255361Z context = <triton._C.libtriton.ir.context object at 0x7fb1026078f0>
2025-05-07T20:32:35.3255365Z 
2025-05-07T20:32:35.3255535Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3255793Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3255900Z                            module_map=module_map)
2025-05-07T20:32:35.3256070Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3256177Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.3256256Z E       ^
2025-05-07T20:32:35.3256606Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3256611Z 
2025-05-07T20:32:35.3257019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3257027Z 
2025-05-07T20:32:35.3257134Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3257352Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3257437Z     T=1,
2025-05-07T20:32:35.3257510Z     D=5120,
2025-05-07T20:32:35.3257589Z     scale_ub=1200.0,
2025-05-07T20:32:35.3257677Z     contiguous=True,
2025-05-07T20:32:35.3257762Z     compiled=True,
2025-05-07T20:32:35.3257838Z )
2025-05-07T20:32:35.3258068Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3258230Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.3258234Z 
2025-05-07T20:32:35.3258313Z     @given(
2025-05-07T20:32:35.3258438Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3258538Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3258652Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3258775Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3258888Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3258969Z     )
2025-05-07T20:32:35.3259430Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3259561Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3259647Z         self,
2025-05-07T20:32:35.3259728Z         T: int,
2025-05-07T20:32:35.3259805Z         D: int,
2025-05-07T20:32:35.3259912Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3260008Z         contiguous: bool,
2025-05-07T20:32:35.3260094Z         compiled: bool,
2025-05-07T20:32:35.3260180Z     ) -> None:
2025-05-07T20:32:35.3260272Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3260341Z     
2025-05-07T20:32:35.3260515Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3260587Z     
2025-05-07T20:32:35.3260683Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3260806Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3260897Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3260985Z         x0 = x[:, :D]
2025-05-07T20:32:35.3261065Z         x1 = x[:, D:]
2025-05-07T20:32:35.3261138Z     
2025-05-07T20:32:35.3261228Z         if contiguous:
2025-05-07T20:32:35.3261319Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3261407Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3261628Z     
2025-05-07T20:32:35.3261720Z         if scale_ub is not None:
2025-05-07T20:32:35.3261824Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3262116Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3262200Z             )
2025-05-07T20:32:35.3262281Z         else:
2025-05-07T20:32:35.3262374Z             scale_ub_tensor = None
2025-05-07T20:32:35.3262449Z     
2025-05-07T20:32:35.3262583Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3262678Z             op = silu_mul_quant
2025-05-07T20:32:35.3262828Z             if compiled:
2025-05-07T20:32:35.3262936Z                 op = torch.compile(op)
2025-05-07T20:32:35.3263041Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3263111Z     
2025-05-07T20:32:35.3263207Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3263212Z 
2025-05-07T20:32:35.3263309Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3263439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3263548Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3263653Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3264023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3264116Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3264603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3264707Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3265059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3265284Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3265616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3265711Z     kernel = self.compile(
2025-05-07T20:32:35.3266098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3266271Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3266397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3266401Z 
2025-05-07T20:32:35.3266610Z self = <triton.compiler.compiler.ASTSource object at 0x7fb103875850>
2025-05-07T20:32:35.3267373Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3267878Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb102dccf40>}
2025-05-07T20:32:35.3268613Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3268805Z context = <triton._C.libtriton.ir.context object at 0x7fb1026e4d30>
2025-05-07T20:32:35.3268810Z 
2025-05-07T20:32:35.3268971Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3269226Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3269343Z                            module_map=module_map)
2025-05-07T20:32:35.3269501Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3269597Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3269680Z E       ^
2025-05-07T20:32:35.3270031Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3270082Z 
2025-05-07T20:32:35.3270495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3270573Z 
2025-05-07T20:32:35.3270678Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3270900Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3270984Z     T=1,
2025-05-07T20:32:35.3271064Z     D=5120,
2025-05-07T20:32:35.3271144Z     scale_ub=None,
2025-05-07T20:32:35.3271236Z     contiguous=False,
2025-05-07T20:32:35.3271316Z     compiled=True,
2025-05-07T20:32:35.3271437Z )
2025-05-07T20:32:35.3271652Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3271813Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.3271818Z 
2025-05-07T20:32:35.3271895Z     @given(
2025-05-07T20:32:35.3272013Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3272111Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3272230Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3272348Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3272459Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3272540Z     )
2025-05-07T20:32:35.3272782Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3272878Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3272953Z         self,
2025-05-07T20:32:35.3273028Z         T: int,
2025-05-07T20:32:35.3273110Z         D: int,
2025-05-07T20:32:35.3273211Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3273299Z         contiguous: bool,
2025-05-07T20:32:35.3273388Z         compiled: bool,
2025-05-07T20:32:35.3273466Z     ) -> None:
2025-05-07T20:32:35.3273557Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3273631Z     
2025-05-07T20:32:35.3273798Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3273873Z     
2025-05-07T20:32:35.3273969Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3274091Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3274191Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3274269Z         x0 = x[:, :D]
2025-05-07T20:32:35.3274345Z         x1 = x[:, D:]
2025-05-07T20:32:35.3274425Z     
2025-05-07T20:32:35.3274509Z         if contiguous:
2025-05-07T20:32:35.3274601Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3274693Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3274767Z     
2025-05-07T20:32:35.3274856Z         if scale_ub is not None:
2025-05-07T20:32:35.3274970Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3275100Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3275173Z             )
2025-05-07T20:32:35.3275255Z         else:
2025-05-07T20:32:35.3275352Z             scale_ub_tensor = None
2025-05-07T20:32:35.3275430Z     
2025-05-07T20:32:35.3275558Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3275650Z             op = silu_mul_quant
2025-05-07T20:32:35.3275739Z             if compiled:
2025-05-07T20:32:35.3275842Z                 op = torch.compile(op)
2025-05-07T20:32:35.3275950Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3276026Z     
2025-05-07T20:32:35.3276116Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.3276235Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.3276313Z     
2025-05-07T20:32:35.3276453Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3276558Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.3276662Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.3276783Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.3276925Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.3277007Z     
2025-05-07T20:32:35.3277107Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.3277160Z 
2025-05-07T20:32:35.3277265Z moe/activation_test.py:126: 
2025-05-07T20:32:35.3277466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3277572Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.3277712Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.3278262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.3278365Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.3278757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3278976Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3279340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.3279595Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.3279968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.3280143Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.3280480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.3280562Z     fn()
2025-05-07T20:32:35.3280957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.3281041Z     self.fn.run(
2025-05-07T20:32:35.3281380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3281475Z     kernel = self.compile(
2025-05-07T20:32:35.3281855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3282031Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3282162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3282167Z 
2025-05-07T20:32:35.3282377Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017dd08c0>
2025-05-07T20:32:35.3283135Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3283644Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb102dcef20>}
2025-05-07T20:32:35.3284373Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3284563Z context = <triton._C.libtriton.ir.context object at 0x7fb102626670>
2025-05-07T20:32:35.3284570Z 
2025-05-07T20:32:35.3284739Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3285000Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3285110Z                            module_map=module_map)
2025-05-07T20:32:35.3285271Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3285375Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.3285462Z E       ^
2025-05-07T20:32:35.3285858Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3285862Z 
2025-05-07T20:32:35.3286269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3286324Z 
2025-05-07T20:32:35.3286427Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3286723Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3286805Z     T=1,
2025-05-07T20:32:35.3286880Z     D=5120,
2025-05-07T20:32:35.3286959Z     scale_ub=None,
2025-05-07T20:32:35.3287049Z     contiguous=True,
2025-05-07T20:32:35.3287133Z     compiled=False,
2025-05-07T20:32:35.3287204Z )
2025-05-07T20:32:35.3287423Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3287585Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.3287629Z 
2025-05-07T20:32:35.3287710Z     @given(
2025-05-07T20:32:35.3287827Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3287927Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3288047Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3288162Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3288277Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3288358Z     )
2025-05-07T20:32:35.3288602Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3288693Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3288777Z         self,
2025-05-07T20:32:35.3288852Z         T: int,
2025-05-07T20:32:35.3288926Z         D: int,
2025-05-07T20:32:35.3289030Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3289119Z         contiguous: bool,
2025-05-07T20:32:35.3289211Z         compiled: bool,
2025-05-07T20:32:35.3289291Z     ) -> None:
2025-05-07T20:32:35.3289384Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3289461Z     
2025-05-07T20:32:35.3289627Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3289703Z     
2025-05-07T20:32:35.3289801Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3289925Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3290016Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3290102Z         x0 = x[:, :D]
2025-05-07T20:32:35.3290181Z         x1 = x[:, D:]
2025-05-07T20:32:35.3290259Z     
2025-05-07T20:32:35.3290348Z         if contiguous:
2025-05-07T20:32:35.3290437Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3290529Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3290604Z     
2025-05-07T20:32:35.3290693Z         if scale_ub is not None:
2025-05-07T20:32:35.3290799Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3290932Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3291011Z             )
2025-05-07T20:32:35.3291091Z         else:
2025-05-07T20:32:35.3291186Z             scale_ub_tensor = None
2025-05-07T20:32:35.3291259Z     
2025-05-07T20:32:35.3291395Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3291485Z             op = silu_mul_quant
2025-05-07T20:32:35.3291568Z             if compiled:
2025-05-07T20:32:35.3291676Z                 op = torch.compile(op)
2025-05-07T20:32:35.3291781Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3291865Z     
2025-05-07T20:32:35.3291960Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3291965Z 
2025-05-07T20:32:35.3292063Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3292196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3292295Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3292395Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3292893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3293046Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3293408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3293628Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3294036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3294210Z     kernel = self.compile(
2025-05-07T20:32:35.3294588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3294761Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3294893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3294897Z 
2025-05-07T20:32:35.3295140Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017851ac0>
2025-05-07T20:32:35.3295953Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3296448Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb102dcfa60>}
2025-05-07T20:32:35.3297190Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3297383Z context = <triton._C.libtriton.ir.context object at 0x7fb017cb33f0>
2025-05-07T20:32:35.3297387Z 
2025-05-07T20:32:35.3297551Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3297815Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3297921Z                            module_map=module_map)
2025-05-07T20:32:35.3298081Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3298186Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3298264Z E       ^
2025-05-07T20:32:35.3298618Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3298626Z 
2025-05-07T20:32:35.3299032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3299036Z 
2025-05-07T20:32:35.3299137Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3299361Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3299439Z     T=128,
2025-05-07T20:32:35.3299523Z     D=5120,
2025-05-07T20:32:35.3299606Z     scale_ub=None,
2025-05-07T20:32:35.3299694Z     contiguous=False,
2025-05-07T20:32:35.3299783Z     compiled=True,
2025-05-07T20:32:35.3299856Z )
2025-05-07T20:32:35.3300070Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3300244Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.3300251Z 
2025-05-07T20:32:35.3300329Z     @given(
2025-05-07T20:32:35.3300449Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3300557Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3300670Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3300792Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3300903Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3300979Z     )
2025-05-07T20:32:35.3301225Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3301321Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3301397Z         self,
2025-05-07T20:32:35.3301479Z         T: int,
2025-05-07T20:32:35.3301556Z         D: int,
2025-05-07T20:32:35.3301652Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3301747Z         contiguous: bool,
2025-05-07T20:32:35.3301832Z         compiled: bool,
2025-05-07T20:32:35.3301907Z     ) -> None:
2025-05-07T20:32:35.3302054Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3302125Z     
2025-05-07T20:32:35.3302292Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3302445Z     
2025-05-07T20:32:35.3302539Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3302667Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3302754Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3302831Z         x0 = x[:, :D]
2025-05-07T20:32:35.3302917Z         x1 = x[:, D:]
2025-05-07T20:32:35.3302989Z     
2025-05-07T20:32:35.3303072Z         if contiguous:
2025-05-07T20:32:35.3303209Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3303298Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3303372Z     
2025-05-07T20:32:35.3303469Z         if scale_ub is not None:
2025-05-07T20:32:35.3303574Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3303706Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3303790Z             )
2025-05-07T20:32:35.3303869Z         else:
2025-05-07T20:32:35.3303970Z             scale_ub_tensor = None
2025-05-07T20:32:35.3304043Z     
2025-05-07T20:32:35.3304176Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3304268Z             op = silu_mul_quant
2025-05-07T20:32:35.3304351Z             if compiled:
2025-05-07T20:32:35.3304449Z                 op = torch.compile(op)
2025-05-07T20:32:35.3304560Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3304634Z     
2025-05-07T20:32:35.3304723Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3304731Z 
2025-05-07T20:32:35.3304833Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3304960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3305062Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3305162Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3305523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3305621Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3306110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3306207Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3306563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3306782Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3307121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3307216Z     kernel = self.compile(
2025-05-07T20:32:35.3307590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3307768Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3307895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3307899Z 
2025-05-07T20:32:35.3308109Z self = <triton.compiler.compiler.ASTSource object at 0x7fb0178c93d0>
2025-05-07T20:32:35.3308873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3309369Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10201d1c0>}
2025-05-07T20:32:35.3310108Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3310295Z context = <triton._C.libtriton.ir.context object at 0x7fb01740c5b0>
2025-05-07T20:32:35.3310346Z 
2025-05-07T20:32:35.3310517Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3310845Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3310953Z                            module_map=module_map)
2025-05-07T20:32:35.3311119Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3311220Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3311298Z E       ^
2025-05-07T20:32:35.3311653Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3311695Z 
2025-05-07T20:32:35.3312102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3312106Z 
2025-05-07T20:32:35.3312213Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3312431Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3312512Z     T=128,
2025-05-07T20:32:35.3312596Z     D=7168,
2025-05-07T20:32:35.3312685Z     scale_ub=1200.0,
2025-05-07T20:32:35.3312770Z     contiguous=False,
2025-05-07T20:32:35.3312861Z     compiled=False,
2025-05-07T20:32:35.3312935Z )
2025-05-07T20:32:35.3313156Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3313326Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.3313330Z 
2025-05-07T20:32:35.3313404Z     @given(
2025-05-07T20:32:35.3313536Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3313636Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3313750Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3313874Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3313991Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3314073Z     )
2025-05-07T20:32:35.3314317Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3314415Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3314500Z         self,
2025-05-07T20:32:35.3314579Z         T: int,
2025-05-07T20:32:35.3314657Z         D: int,
2025-05-07T20:32:35.3314760Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3314849Z         contiguous: bool,
2025-05-07T20:32:35.3314934Z         compiled: bool,
2025-05-07T20:32:35.3315015Z     ) -> None:
2025-05-07T20:32:35.3315108Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3315184Z     
2025-05-07T20:32:35.3315358Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3315431Z     
2025-05-07T20:32:35.3315521Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3315666Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3315766Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3315863Z         x0 = x[:, :D]
2025-05-07T20:32:35.3315958Z         x1 = x[:, D:]
2025-05-07T20:32:35.3316029Z     
2025-05-07T20:32:35.3316116Z         if contiguous:
2025-05-07T20:32:35.3316212Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3316300Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3316376Z     
2025-05-07T20:32:35.3316466Z         if scale_ub is not None:
2025-05-07T20:32:35.3316569Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3316709Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3316783Z             )
2025-05-07T20:32:35.3316860Z         else:
2025-05-07T20:32:35.3316961Z             scale_ub_tensor = None
2025-05-07T20:32:35.3317034Z     
2025-05-07T20:32:35.3317168Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3317256Z             op = silu_mul_quant
2025-05-07T20:32:35.3317339Z             if compiled:
2025-05-07T20:32:35.3317445Z                 op = torch.compile(op)
2025-05-07T20:32:35.3317548Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3317670Z     
2025-05-07T20:32:35.3317768Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3317773Z 
2025-05-07T20:32:35.3317944Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3318074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3318177Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3318272Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3318770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3318929Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3319279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3319504Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3319836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3319931Z     kernel = self.compile(
2025-05-07T20:32:35.3320316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3320489Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3320619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3320623Z 
2025-05-07T20:32:35.3320825Z self = <triton.compiler.compiler.ASTSource object at 0x7fb0178caba0>
2025-05-07T20:32:35.3321583Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3322086Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10201cd60>}
2025-05-07T20:32:35.3322823Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3323014Z context = <triton._C.libtriton.ir.context object at 0x7fb0174b1e70>
2025-05-07T20:32:35.3323019Z 
2025-05-07T20:32:35.3323181Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3323446Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3323554Z                            module_map=module_map)
2025-05-07T20:32:35.3323715Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3323819Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3323897Z E       ^
2025-05-07T20:32:35.3328819Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3328836Z 
2025-05-07T20:32:35.3329439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3329445Z 
2025-05-07T20:32:35.3329556Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3329780Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3329857Z     T=128,
2025-05-07T20:32:35.3329940Z     D=5120,
2025-05-07T20:32:35.3330021Z     scale_ub=None,
2025-05-07T20:32:35.3330111Z     contiguous=False,
2025-05-07T20:32:35.3330197Z     compiled=False,
2025-05-07T20:32:35.3330268Z )
2025-05-07T20:32:35.3330492Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3330661Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.3330666Z 
2025-05-07T20:32:35.3330742Z     @given(
2025-05-07T20:32:35.3330866Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3331049Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3331163Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3331359Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3331473Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3331548Z     )
2025-05-07T20:32:35.3331799Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3331893Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3331977Z         self,
2025-05-07T20:32:35.3332055Z         T: int,
2025-05-07T20:32:35.3332206Z         D: int,
2025-05-07T20:32:35.3332308Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3332398Z         contiguous: bool,
2025-05-07T20:32:35.3332492Z         compiled: bool,
2025-05-07T20:32:35.3332574Z     ) -> None:
2025-05-07T20:32:35.3332667Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3332736Z     
2025-05-07T20:32:35.3332906Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3332983Z     
2025-05-07T20:32:35.3333151Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3333284Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3333369Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3333452Z         x0 = x[:, :D]
2025-05-07T20:32:35.3333530Z         x1 = x[:, D:]
2025-05-07T20:32:35.3333602Z     
2025-05-07T20:32:35.3333689Z         if contiguous:
2025-05-07T20:32:35.3333780Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3333866Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3333944Z     
2025-05-07T20:32:35.3334036Z         if scale_ub is not None:
2025-05-07T20:32:35.3334140Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3334275Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3334350Z             )
2025-05-07T20:32:35.3334423Z         else:
2025-05-07T20:32:35.3334520Z             scale_ub_tensor = None
2025-05-07T20:32:35.3334595Z     
2025-05-07T20:32:35.3334726Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3334814Z             op = silu_mul_quant
2025-05-07T20:32:35.3334901Z             if compiled:
2025-05-07T20:32:35.3335008Z                 op = torch.compile(op)
2025-05-07T20:32:35.3335110Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3335182Z     
2025-05-07T20:32:35.3335275Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3335279Z 
2025-05-07T20:32:35.3335373Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3335498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3335604Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3335701Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3336195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3336289Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3336645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3336871Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3337203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3337293Z     kernel = self.compile(
2025-05-07T20:32:35.3337672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3337846Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3337978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3337983Z 
2025-05-07T20:32:35.3338183Z self = <triton.compiler.compiler.ASTSource object at 0x7fb0178c9a30>
2025-05-07T20:32:35.3338948Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3339568Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10201e2a0>}
2025-05-07T20:32:35.3340303Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3340532Z context = <triton._C.libtriton.ir.context object at 0x7fb017369ff0>
2025-05-07T20:32:35.3340537Z 
2025-05-07T20:32:35.3340698Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3340959Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3341064Z                            module_map=module_map)
2025-05-07T20:32:35.3341226Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3341330Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3341405Z E       ^
2025-05-07T20:32:35.3341750Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3341755Z 
2025-05-07T20:32:35.3342162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3342167Z 
2025-05-07T20:32:35.3342268Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3342491Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3342564Z     T=128,
2025-05-07T20:32:35.3342639Z     D=5120,
2025-05-07T20:32:35.3342723Z     scale_ub=1200.0,
2025-05-07T20:32:35.3342804Z     contiguous=True,
2025-05-07T20:32:35.3342884Z     compiled=False,
2025-05-07T20:32:35.3342956Z )
2025-05-07T20:32:35.3343170Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3343340Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.3343349Z 
2025-05-07T20:32:35.3343421Z     @given(
2025-05-07T20:32:35.3343537Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3343639Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3343750Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3343865Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3343986Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3344061Z     )
2025-05-07T20:32:35.3344304Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3344400Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3344476Z         self,
2025-05-07T20:32:35.3344552Z         T: int,
2025-05-07T20:32:35.3344629Z         D: int,
2025-05-07T20:32:35.3344728Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3344819Z         contiguous: bool,
2025-05-07T20:32:35.3344904Z         compiled: bool,
2025-05-07T20:32:35.3344983Z     ) -> None:
2025-05-07T20:32:35.3345083Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3345156Z     
2025-05-07T20:32:35.3345323Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3345399Z     
2025-05-07T20:32:35.3345490Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3345613Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3345703Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3345784Z         x0 = x[:, :D]
2025-05-07T20:32:35.3345862Z         x1 = x[:, D:]
2025-05-07T20:32:35.3345937Z     
2025-05-07T20:32:35.3346017Z         if contiguous:
2025-05-07T20:32:35.3346108Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3346194Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3346266Z     
2025-05-07T20:32:35.3346359Z         if scale_ub is not None:
2025-05-07T20:32:35.3346513Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3346646Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3346798Z             )
2025-05-07T20:32:35.3346872Z         else:
2025-05-07T20:32:35.3346966Z             scale_ub_tensor = None
2025-05-07T20:32:35.3347041Z     
2025-05-07T20:32:35.3347170Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3347257Z             op = silu_mul_quant
2025-05-07T20:32:35.3347345Z             if compiled:
2025-05-07T20:32:35.3347443Z                 op = torch.compile(op)
2025-05-07T20:32:35.3347591Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3347660Z     
2025-05-07T20:32:35.3347748Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3347752Z 
2025-05-07T20:32:35.3347850Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3347976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3348081Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3348182Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3348679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3348776Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3349133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3349351Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3349691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3349781Z     kernel = self.compile(
2025-05-07T20:32:35.3350153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3350328Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3350453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3350458Z 
2025-05-07T20:32:35.3350667Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017323260>
2025-05-07T20:32:35.3351427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3351920Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10201f380>}
2025-05-07T20:32:35.3352656Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3352843Z context = <triton._C.libtriton.ir.context object at 0x7fb0173ce5b0>
2025-05-07T20:32:35.3352850Z 
2025-05-07T20:32:35.3353017Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3353280Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3353384Z                            module_map=module_map)
2025-05-07T20:32:35.3353544Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3353641Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3353720Z E       ^
2025-05-07T20:32:35.3354068Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3354075Z 
2025-05-07T20:32:35.3354478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3354483Z 
2025-05-07T20:32:35.3354589Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3354806Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3354929Z     T=1,
2025-05-07T20:32:35.3355002Z     D=7168,
2025-05-07T20:32:35.3355180Z     scale_ub=1200.0,
2025-05-07T20:32:35.3355269Z     contiguous=True,
2025-05-07T20:32:35.3355349Z     compiled=True,
2025-05-07T20:32:35.3355420Z )
2025-05-07T20:32:35.3355667Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3355849Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.3355854Z 
2025-05-07T20:32:35.3355930Z     @given(
2025-05-07T20:32:35.3356096Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3356194Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3356305Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3356421Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3356530Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3356605Z     )
2025-05-07T20:32:35.3356845Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3356944Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3357025Z         self,
2025-05-07T20:32:35.3357100Z         T: int,
2025-05-07T20:32:35.3357173Z         D: int,
2025-05-07T20:32:35.3357272Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3357360Z         contiguous: bool,
2025-05-07T20:32:35.3357444Z         compiled: bool,
2025-05-07T20:32:35.3357528Z     ) -> None:
2025-05-07T20:32:35.3357622Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3357699Z     
2025-05-07T20:32:35.3357869Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3357939Z     
2025-05-07T20:32:35.3358036Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3358160Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3358245Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3358328Z         x0 = x[:, :D]
2025-05-07T20:32:35.3358407Z         x1 = x[:, D:]
2025-05-07T20:32:35.3358480Z     
2025-05-07T20:32:35.3358568Z         if contiguous:
2025-05-07T20:32:35.3358661Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3358749Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3358824Z     
2025-05-07T20:32:35.3358911Z         if scale_ub is not None:
2025-05-07T20:32:35.3359017Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3359152Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3359426Z             )
2025-05-07T20:32:35.3359540Z         else:
2025-05-07T20:32:35.3359641Z             scale_ub_tensor = None
2025-05-07T20:32:35.3359713Z     
2025-05-07T20:32:35.3359844Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3359932Z             op = silu_mul_quant
2025-05-07T20:32:35.3360017Z             if compiled:
2025-05-07T20:32:35.3360122Z                 op = torch.compile(op)
2025-05-07T20:32:35.3360226Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3360298Z     
2025-05-07T20:32:35.3360391Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3360396Z 
2025-05-07T20:32:35.3360493Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3360625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3360730Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3360828Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3361195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3361292Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3361773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3361873Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3362222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3362538Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3362982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3363076Z     kernel = self.compile(
2025-05-07T20:32:35.3363453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3363622Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3363747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3363809Z 
2025-05-07T20:32:35.3364019Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017321490>
2025-05-07T20:32:35.3364792Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3365297Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb0173c4a40>}
2025-05-07T20:32:35.3366075Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3366267Z context = <triton._C.libtriton.ir.context object at 0x7fb01751a9b0>
2025-05-07T20:32:35.3366275Z 
2025-05-07T20:32:35.3366435Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3366691Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3366802Z                            module_map=module_map)
2025-05-07T20:32:35.3366961Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3367059Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3367137Z E       ^
2025-05-07T20:32:35.3367489Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3367494Z 
2025-05-07T20:32:35.3367902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3367906Z 
2025-05-07T20:32:35.3368006Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3368224Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3368306Z     T=1,
2025-05-07T20:32:35.3368381Z     D=7168,
2025-05-07T20:32:35.3368465Z     scale_ub=1200.0,
2025-05-07T20:32:35.3368554Z     contiguous=False,
2025-05-07T20:32:35.3368633Z     compiled=True,
2025-05-07T20:32:35.3368710Z )
2025-05-07T20:32:35.3368921Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3369084Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.3369092Z 
2025-05-07T20:32:35.3369172Z     @given(
2025-05-07T20:32:35.3369297Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3369395Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3369511Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3369625Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3369741Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3369814Z     )
2025-05-07T20:32:35.3370054Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3370148Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3370222Z         self,
2025-05-07T20:32:35.3370298Z         T: int,
2025-05-07T20:32:35.3370379Z         D: int,
2025-05-07T20:32:35.3370475Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3370563Z         contiguous: bool,
2025-05-07T20:32:35.3370650Z         compiled: bool,
2025-05-07T20:32:35.3370772Z     ) -> None:
2025-05-07T20:32:35.3370864Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3370936Z     
2025-05-07T20:32:35.3371175Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3371250Z     
2025-05-07T20:32:35.3371342Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3371464Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3371556Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3371634Z         x0 = x[:, :D]
2025-05-07T20:32:35.3371715Z         x1 = x[:, D:]
2025-05-07T20:32:35.3371832Z     
2025-05-07T20:32:35.3371912Z         if contiguous:
2025-05-07T20:32:35.3372001Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3372091Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3372162Z     
2025-05-07T20:32:35.3372249Z         if scale_ub is not None:
2025-05-07T20:32:35.3372355Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3372485Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3372558Z             )
2025-05-07T20:32:35.3372633Z         else:
2025-05-07T20:32:35.3372732Z             scale_ub_tensor = None
2025-05-07T20:32:35.3372810Z     
2025-05-07T20:32:35.3372935Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3373071Z             op = silu_mul_quant
2025-05-07T20:32:35.3373155Z             if compiled:
2025-05-07T20:32:35.3373252Z                 op = torch.compile(op)
2025-05-07T20:32:35.3373353Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3373428Z     
2025-05-07T20:32:35.3373523Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3373527Z 
2025-05-07T20:32:35.3373621Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3373748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3373845Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3373950Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3374309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3374404Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3374891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3374985Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3375332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3375554Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3375887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3375983Z     kernel = self.compile(
2025-05-07T20:32:35.3376356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3376524Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3376654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3376663Z 
2025-05-07T20:32:35.3376865Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017322e40>
2025-05-07T20:32:35.3377624Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3378117Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb0173c60c0>}
2025-05-07T20:32:35.3378846Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3379090Z context = <triton._C.libtriton.ir.context object at 0x7fb017cbd9f0>
2025-05-07T20:32:35.3379095Z 
2025-05-07T20:32:35.3379329Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3379588Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3379694Z                            module_map=module_map)
2025-05-07T20:32:35.3379852Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3379951Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3380066Z E       ^
2025-05-07T20:32:35.3380410Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3380420Z 
2025-05-07T20:32:35.3380821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3380826Z 
2025-05-07T20:32:35.3380925Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3381149Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3381226Z     T=1,
2025-05-07T20:32:35.3381307Z     D=7168,
2025-05-07T20:32:35.3381391Z     scale_ub=None,
2025-05-07T20:32:35.3381477Z     contiguous=False,
2025-05-07T20:32:35.3381558Z     compiled=True,
2025-05-07T20:32:35.3381632Z )
2025-05-07T20:32:35.3381844Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3382008Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.3382017Z 
2025-05-07T20:32:35.3382092Z     @given(
2025-05-07T20:32:35.3382207Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3382304Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3382413Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3382526Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3382642Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3382715Z     )
2025-05-07T20:32:35.3382962Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3383052Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3383128Z         self,
2025-05-07T20:32:35.3383207Z         T: int,
2025-05-07T20:32:35.3383282Z         D: int,
2025-05-07T20:32:35.3383377Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3383465Z         contiguous: bool,
2025-05-07T20:32:35.3383547Z         compiled: bool,
2025-05-07T20:32:35.3383621Z     ) -> None:
2025-05-07T20:32:35.3383721Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3383794Z     
2025-05-07T20:32:35.3383960Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3384038Z     
2025-05-07T20:32:35.3384128Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3384250Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3384341Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3384419Z         x0 = x[:, :D]
2025-05-07T20:32:35.3384497Z         x1 = x[:, D:]
2025-05-07T20:32:35.3384567Z     
2025-05-07T20:32:35.3384650Z         if contiguous:
2025-05-07T20:32:35.3384741Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3384826Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3384894Z     
2025-05-07T20:32:35.3384984Z         if scale_ub is not None:
2025-05-07T20:32:35.3385087Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3385217Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3385295Z             )
2025-05-07T20:32:35.3385370Z         else:
2025-05-07T20:32:35.3385462Z             scale_ub_tensor = None
2025-05-07T20:32:35.3385535Z     
2025-05-07T20:32:35.3385665Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3385762Z             op = silu_mul_quant
2025-05-07T20:32:35.3385844Z             if compiled:
2025-05-07T20:32:35.3385939Z                 op = torch.compile(op)
2025-05-07T20:32:35.3386136Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3386204Z     
2025-05-07T20:32:35.3386290Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.3386486Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.3386558Z     
2025-05-07T20:32:35.3386693Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3386796Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.3386894Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.3387013Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.3387193Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.3387268Z     
2025-05-07T20:32:35.3387373Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.3387378Z 
2025-05-07T20:32:35.3387469Z moe/activation_test.py:126: 
2025-05-07T20:32:35.3387595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3387704Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.3387834Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.3388386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.3388488Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.3388837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3389056Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3389417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.3389666Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.3390037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.3390202Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.3390544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.3390615Z     fn()
2025-05-07T20:32:35.3391006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.3391088Z     self.fn.run(
2025-05-07T20:32:35.3391417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3391511Z     kernel = self.compile(
2025-05-07T20:32:35.3391885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3392055Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3392183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3392189Z 
2025-05-07T20:32:35.3392391Z self = <triton.compiler.compiler.ASTSource object at 0x7fb0175084a0>
2025-05-07T20:32:35.3393153Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3393650Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb0173c6de0>}
2025-05-07T20:32:35.3394381Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3394570Z context = <triton._C.libtriton.ir.context object at 0x7fb017c7daf0>
2025-05-07T20:32:35.3394574Z 
2025-05-07T20:32:35.3394736Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3395115Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3395224Z                            module_map=module_map)
2025-05-07T20:32:35.3395382Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3395483Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.3395575Z E       ^
2025-05-07T20:32:35.3395954Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3395998Z 
2025-05-07T20:32:35.3396405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3396410Z 
2025-05-07T20:32:35.3396510Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3396730Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3396808Z     T=1,
2025-05-07T20:32:35.3396884Z     D=5120,
2025-05-07T20:32:35.3396968Z     scale_ub=1200.0,
2025-05-07T20:32:35.3397052Z     contiguous=False,
2025-05-07T20:32:35.3397138Z     compiled=True,
2025-05-07T20:32:35.3397212Z )
2025-05-07T20:32:35.3397423Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3397584Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.3397589Z 
2025-05-07T20:32:35.3397670Z     @given(
2025-05-07T20:32:35.3397785Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3397891Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3398004Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3398120Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3398234Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3398306Z     )
2025-05-07T20:32:35.3398548Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3398643Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3398714Z         self,
2025-05-07T20:32:35.3398788Z         T: int,
2025-05-07T20:32:35.3398868Z         D: int,
2025-05-07T20:32:35.3398965Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3399056Z         contiguous: bool,
2025-05-07T20:32:35.3399140Z         compiled: bool,
2025-05-07T20:32:35.3399214Z     ) -> None:
2025-05-07T20:32:35.3399307Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3399378Z     
2025-05-07T20:32:35.3399543Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3399623Z     
2025-05-07T20:32:35.3399713Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3399836Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3399927Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3400007Z         x0 = x[:, :D]
2025-05-07T20:32:35.3400084Z         x1 = x[:, D:]
2025-05-07T20:32:35.3400161Z     
2025-05-07T20:32:35.3400246Z         if contiguous:
2025-05-07T20:32:35.3400332Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3400423Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3400496Z     
2025-05-07T20:32:35.3400585Z         if scale_ub is not None:
2025-05-07T20:32:35.3400686Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3400817Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3400901Z             )
2025-05-07T20:32:35.3400975Z         else:
2025-05-07T20:32:35.3401068Z             scale_ub_tensor = None
2025-05-07T20:32:35.3401146Z     
2025-05-07T20:32:35.3401275Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3401361Z             op = silu_mul_quant
2025-05-07T20:32:35.3401448Z             if compiled:
2025-05-07T20:32:35.3401544Z                 op = torch.compile(op)
2025-05-07T20:32:35.3401645Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3401720Z     
2025-05-07T20:32:35.3401806Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3401858Z 
2025-05-07T20:32:35.3401957Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3402154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3402253Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3402355Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3402713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3402803Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3403288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3403424Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3403776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3403995Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3404328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3404426Z     kernel = self.compile(
2025-05-07T20:32:35.3404797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3404966Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3405095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3405100Z 
2025-05-07T20:32:35.3405304Z self = <triton.compiler.compiler.ASTSource object at 0x7fb0175081d0>
2025-05-07T20:32:35.3406066Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3406556Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017c68540>}
2025-05-07T20:32:35.3407292Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3407476Z context = <triton._C.libtriton.ir.context object at 0x7fb017ccdc70>
2025-05-07T20:32:35.3407481Z 
2025-05-07T20:32:35.3407643Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3407902Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3408004Z                            module_map=module_map)
2025-05-07T20:32:35.3408165Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3408262Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3408334Z E       ^
2025-05-07T20:32:35.3408687Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3408691Z 
2025-05-07T20:32:35.3409097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3409102Z 
2025-05-07T20:32:35.3409201Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3409421Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3409497Z     T=1,
2025-05-07T20:32:35.3409574Z     D=5120,
2025-05-07T20:32:35.3409658Z     scale_ub=1200.0,
2025-05-07T20:32:35.3409741Z     contiguous=False,
2025-05-07T20:32:35.3409825Z     compiled=False,
2025-05-07T20:32:35.3409896Z )
2025-05-07T20:32:35.3410108Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3410275Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.3410280Z 
2025-05-07T20:32:35.3410400Z     @given(
2025-05-07T20:32:35.3410515Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3410688Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3410802Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3410919Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3411029Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3411102Z     )
2025-05-07T20:32:35.3411345Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3411436Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3411572Z         self,
2025-05-07T20:32:35.3411651Z         T: int,
2025-05-07T20:32:35.3411723Z         D: int,
2025-05-07T20:32:35.3411820Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3411910Z         contiguous: bool,
2025-05-07T20:32:35.3411993Z         compiled: bool,
2025-05-07T20:32:35.3412065Z     ) -> None:
2025-05-07T20:32:35.3412160Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3412235Z     
2025-05-07T20:32:35.3412403Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3412481Z     
2025-05-07T20:32:35.3412569Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3412693Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3412779Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3412853Z         x0 = x[:, :D]
2025-05-07T20:32:35.3412934Z         x1 = x[:, D:]
2025-05-07T20:32:35.3413069Z     
2025-05-07T20:32:35.3413150Z         if contiguous:
2025-05-07T20:32:35.3413241Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3413332Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3413403Z     
2025-05-07T20:32:35.3413494Z         if scale_ub is not None:
2025-05-07T20:32:35.3413597Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3413730Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3413802Z             )
2025-05-07T20:32:35.3413875Z         else:
2025-05-07T20:32:35.3413973Z             scale_ub_tensor = None
2025-05-07T20:32:35.3414043Z     
2025-05-07T20:32:35.3414173Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3414262Z             op = silu_mul_quant
2025-05-07T20:32:35.3414345Z             if compiled:
2025-05-07T20:32:35.3414443Z                 op = torch.compile(op)
2025-05-07T20:32:35.3414549Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3414618Z     
2025-05-07T20:32:35.3414705Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3414712Z 
2025-05-07T20:32:35.3414809Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3414934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3415034Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3415131Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3415621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3415724Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3416079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3416295Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3416629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3416718Z     kernel = self.compile(
2025-05-07T20:32:35.3417094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3417266Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3417387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3417392Z 
2025-05-07T20:32:35.3417594Z self = <triton.compiler.compiler.ASTSource object at 0x7fb0178983b0>
2025-05-07T20:32:35.3418469Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3418967Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017dbb560>}
2025-05-07T20:32:35.3419695Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3419922Z context = <triton._C.libtriton.ir.context object at 0x7fb017e4fd30>
2025-05-07T20:32:35.3419927Z 
2025-05-07T20:32:35.3420087Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3420342Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3420450Z                            module_map=module_map)
2025-05-07T20:32:35.3420610Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3420705Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3420782Z E       ^
2025-05-07T20:32:35.3421126Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3421131Z 
2025-05-07T20:32:35.3421536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3421543Z 
2025-05-07T20:32:35.3421641Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3421857Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3421935Z     T=16384,
2025-05-07T20:32:35.3422007Z     D=5120,
2025-05-07T20:32:35.3422088Z     scale_ub=1200.0,
2025-05-07T20:32:35.3422179Z     contiguous=False,
2025-05-07T20:32:35.3422259Z     compiled=True,
2025-05-07T20:32:35.3422334Z )
2025-05-07T20:32:35.3422555Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3422729Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.3422733Z 
2025-05-07T20:32:35.3422810Z     @given(
2025-05-07T20:32:35.3422926Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3423021Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3423134Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3423250Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3423359Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3423435Z     )
2025-05-07T20:32:35.3423674Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3423766Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3423838Z         self,
2025-05-07T20:32:35.3423913Z         T: int,
2025-05-07T20:32:35.3423991Z         D: int,
2025-05-07T20:32:35.3424087Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3424179Z         contiguous: bool,
2025-05-07T20:32:35.3424264Z         compiled: bool,
2025-05-07T20:32:35.3424339Z     ) -> None:
2025-05-07T20:32:35.3424431Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3424501Z     
2025-05-07T20:32:35.3424664Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3424738Z     
2025-05-07T20:32:35.3424828Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3424955Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3425042Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3425118Z         x0 = x[:, :D]
2025-05-07T20:32:35.3425194Z         x1 = x[:, D:]
2025-05-07T20:32:35.3425270Z     
2025-05-07T20:32:35.3425349Z         if contiguous:
2025-05-07T20:32:35.3425435Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3425524Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3425640Z     
2025-05-07T20:32:35.3425727Z         if scale_ub is not None:
2025-05-07T20:32:35.3425904Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3426035Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3426111Z             )
2025-05-07T20:32:35.3426188Z         else:
2025-05-07T20:32:35.3426279Z             scale_ub_tensor = None
2025-05-07T20:32:35.3426346Z     
2025-05-07T20:32:35.3426475Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3426562Z             op = silu_mul_quant
2025-05-07T20:32:35.3426692Z             if compiled:
2025-05-07T20:32:35.3426788Z                 op = torch.compile(op)
2025-05-07T20:32:35.3426889Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3426958Z     
2025-05-07T20:32:35.3427045Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3427050Z 
2025-05-07T20:32:35.3427142Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3427272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3427369Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3427469Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3427832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3427920Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3428403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3428501Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3428849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3429068Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3429400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3429503Z     kernel = self.compile(
2025-05-07T20:32:35.3429882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3430054Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3430186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3430191Z 
2025-05-07T20:32:35.3430394Z self = <triton.compiler.compiler.ASTSource object at 0x7fb01789b920>
2025-05-07T20:32:35.3431159Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3431657Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb1024c85e0>}
2025-05-07T20:32:35.3432392Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3432587Z context = <triton._C.libtriton.ir.context object at 0x7fb017c154b0>
2025-05-07T20:32:35.3432591Z 
2025-05-07T20:32:35.3432754Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3433016Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3433126Z                            module_map=module_map)
2025-05-07T20:32:35.3433287Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3433391Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3433469Z E       ^
2025-05-07T20:32:35.3433817Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3433877Z 
2025-05-07T20:32:35.3434897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3434904Z 
2025-05-07T20:32:35.3435008Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3435236Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3435316Z     T=2048,
2025-05-07T20:32:35.3435393Z     D=7168,
2025-05-07T20:32:35.3435482Z     scale_ub=1200.0,
2025-05-07T20:32:35.3435568Z     contiguous=False,
2025-05-07T20:32:35.3435650Z     compiled=True,
2025-05-07T20:32:35.3435769Z )
2025-05-07T20:32:35.3435986Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3436165Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.3436170Z 
2025-05-07T20:32:35.3436245Z     @given(
2025-05-07T20:32:35.3436363Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3436472Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3436585Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3436706Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3436824Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3436900Z     )
2025-05-07T20:32:35.3437145Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3437250Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3437326Z         self,
2025-05-07T20:32:35.3437409Z         T: int,
2025-05-07T20:32:35.3437486Z         D: int,
2025-05-07T20:32:35.3437583Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3437680Z         contiguous: bool,
2025-05-07T20:32:35.3437765Z         compiled: bool,
2025-05-07T20:32:35.3437842Z     ) -> None:
2025-05-07T20:32:35.3437940Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3438014Z     
2025-05-07T20:32:35.3438179Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3438267Z     
2025-05-07T20:32:35.3438357Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3438485Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3438577Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3438656Z         x0 = x[:, :D]
2025-05-07T20:32:35.3438740Z         x1 = x[:, D:]
2025-05-07T20:32:35.3438811Z     
2025-05-07T20:32:35.3438891Z         if contiguous:
2025-05-07T20:32:35.3438988Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3439077Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3439151Z     
2025-05-07T20:32:35.3439251Z         if scale_ub is not None:
2025-05-07T20:32:35.3439357Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3439489Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3439571Z             )
2025-05-07T20:32:35.3439648Z         else:
2025-05-07T20:32:35.3439740Z             scale_ub_tensor = None
2025-05-07T20:32:35.3439821Z     
2025-05-07T20:32:35.3439951Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3440039Z             op = silu_mul_quant
2025-05-07T20:32:35.3440134Z             if compiled:
2025-05-07T20:32:35.3440232Z                 op = torch.compile(op)
2025-05-07T20:32:35.3440343Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3440415Z     
2025-05-07T20:32:35.3440505Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3440510Z 
2025-05-07T20:32:35.3440611Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3440737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3440840Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3440944Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3441306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3441405Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3441890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3442033Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3442487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3442709Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3443042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3443140Z     kernel = self.compile(
2025-05-07T20:32:35.3443553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3443728Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3443854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3443859Z 
2025-05-07T20:32:35.3444062Z self = <triton.compiler.compiler.ASTSource object at 0x7fb01789baa0>
2025-05-07T20:32:35.3444836Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3445332Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10313a8e0>}
2025-05-07T20:32:35.3446116Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3446307Z context = <triton._C.libtriton.ir.context object at 0x7fb017744730>
2025-05-07T20:32:35.3446311Z 
2025-05-07T20:32:35.3446478Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3446738Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3446848Z                            module_map=module_map)
2025-05-07T20:32:35.3447014Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3450870Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3450961Z E       ^
2025-05-07T20:32:35.3451324Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3451329Z 
2025-05-07T20:32:35.3451744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3451749Z 
2025-05-07T20:32:35.3451854Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3452074Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3452151Z     T=1,
2025-05-07T20:32:35.3452229Z     D=5120,
2025-05-07T20:32:35.3452312Z     scale_ub=None,
2025-05-07T20:32:35.3452397Z     contiguous=False,
2025-05-07T20:32:35.3452482Z     compiled=False,
2025-05-07T20:32:35.3452552Z )
2025-05-07T20:32:35.3452771Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3452937Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.3452942Z 
2025-05-07T20:32:35.3453073Z     @given(
2025-05-07T20:32:35.3453193Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3453291Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3453406Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3453524Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3453636Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3453709Z     )
2025-05-07T20:32:35.3453953Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3454043Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3454187Z         self,
2025-05-07T20:32:35.3454264Z         T: int,
2025-05-07T20:32:35.3454337Z         D: int,
2025-05-07T20:32:35.3454506Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3454601Z         contiguous: bool,
2025-05-07T20:32:35.3454684Z         compiled: bool,
2025-05-07T20:32:35.3454764Z     ) -> None:
2025-05-07T20:32:35.3454856Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3454926Z     
2025-05-07T20:32:35.3455093Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3455163Z     
2025-05-07T20:32:35.3455291Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3455417Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3455503Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3455581Z         x0 = x[:, :D]
2025-05-07T20:32:35.3455661Z         x1 = x[:, D:]
2025-05-07T20:32:35.3455729Z     
2025-05-07T20:32:35.3455811Z         if contiguous:
2025-05-07T20:32:35.3455902Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3455991Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3456062Z     
2025-05-07T20:32:35.3456155Z         if scale_ub is not None:
2025-05-07T20:32:35.3456257Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3456391Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3456463Z             )
2025-05-07T20:32:35.3456538Z         else:
2025-05-07T20:32:35.3456633Z             scale_ub_tensor = None
2025-05-07T20:32:35.3456704Z     
2025-05-07T20:32:35.3456830Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3456924Z             op = silu_mul_quant
2025-05-07T20:32:35.3457005Z             if compiled:
2025-05-07T20:32:35.3457102Z                 op = torch.compile(op)
2025-05-07T20:32:35.3457208Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3457277Z     
2025-05-07T20:32:35.3457369Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3457373Z 
2025-05-07T20:32:35.3457471Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3457595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3457702Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3457798Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3458292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3458390Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3458739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3458962Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3459524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3459646Z     kernel = self.compile(
2025-05-07T20:32:35.3460025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3460202Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3460325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3460333Z 
2025-05-07T20:32:35.3460534Z self = <triton.compiler.compiler.ASTSource object at 0x7fb1024b8e90>
2025-05-07T20:32:35.3461291Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3461790Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10333c7c0>}
2025-05-07T20:32:35.3462518Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3462925Z context = <triton._C.libtriton.ir.context object at 0x7fb016f00bb0>
2025-05-07T20:32:35.3462931Z 
2025-05-07T20:32:35.3463093Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3463346Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3463452Z                            module_map=module_map)
2025-05-07T20:32:35.3463610Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3463766Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3463844Z E       ^
2025-05-07T20:32:35.3464192Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3464196Z 
2025-05-07T20:32:35.3464601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3464608Z 
2025-05-07T20:32:35.3464707Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3464929Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3465009Z     T=4096,
2025-05-07T20:32:35.3465083Z     D=7168,
2025-05-07T20:32:35.3465164Z     scale_ub=1200.0,
2025-05-07T20:32:35.3465246Z     contiguous=False,
2025-05-07T20:32:35.3465324Z     compiled=False,
2025-05-07T20:32:35.3465400Z )
2025-05-07T20:32:35.3465611Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3465786Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.3465791Z 
2025-05-07T20:32:35.3465871Z     @given(
2025-05-07T20:32:35.3465987Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3466085Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3466199Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3466316Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3466431Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3466504Z     )
2025-05-07T20:32:35.3466744Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3466836Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3466906Z         self,
2025-05-07T20:32:35.3466980Z         T: int,
2025-05-07T20:32:35.3467055Z         D: int,
2025-05-07T20:32:35.3467152Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3467236Z         contiguous: bool,
2025-05-07T20:32:35.3467326Z         compiled: bool,
2025-05-07T20:32:35.3467403Z     ) -> None:
2025-05-07T20:32:35.3467494Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3467569Z     
2025-05-07T20:32:35.3467734Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3467807Z     
2025-05-07T20:32:35.3467897Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3468021Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3468109Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3468190Z         x0 = x[:, :D]
2025-05-07T20:32:35.3468267Z         x1 = x[:, D:]
2025-05-07T20:32:35.3468338Z     
2025-05-07T20:32:35.3468422Z         if contiguous:
2025-05-07T20:32:35.3468509Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3468602Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3468674Z     
2025-05-07T20:32:35.3468764Z         if scale_ub is not None:
2025-05-07T20:32:35.3468868Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3469000Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3469070Z             )
2025-05-07T20:32:35.3469147Z         else:
2025-05-07T20:32:35.3469238Z             scale_ub_tensor = None
2025-05-07T20:32:35.3469311Z     
2025-05-07T20:32:35.3469436Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3469523Z             op = silu_mul_quant
2025-05-07T20:32:35.3469656Z             if compiled:
2025-05-07T20:32:35.3469752Z                 op = torch.compile(op)
2025-05-07T20:32:35.3469922Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3470001Z     
2025-05-07T20:32:35.3470088Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3470093Z 
2025-05-07T20:32:35.3470191Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3470318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3470417Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3470517Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3471046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3471142Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3471494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3471713Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3472047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3472142Z     kernel = self.compile(
2025-05-07T20:32:35.3472513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3472687Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3472809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3472816Z 
2025-05-07T20:32:35.3473017Z self = <triton.compiler.compiler.ASTSource object at 0x7fb102babec0>
2025-05-07T20:32:35.3473782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3474278Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb1085b5620>}
2025-05-07T20:32:35.3475008Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3475195Z context = <triton._C.libtriton.ir.context object at 0x7fb01705ff70>
2025-05-07T20:32:35.3475199Z 
2025-05-07T20:32:35.3475362Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3475616Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3475718Z                            module_map=module_map)
2025-05-07T20:32:35.3475879Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3475976Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3476052Z E       ^
2025-05-07T20:32:35.3476405Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3476410Z 
2025-05-07T20:32:35.3476812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3476817Z 
2025-05-07T20:32:35.3476918Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3477135Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3477213Z     T=16384,
2025-05-07T20:32:35.3477290Z     D=7168,
2025-05-07T20:32:35.3477368Z     scale_ub=None,
2025-05-07T20:32:35.3477449Z     contiguous=True,
2025-05-07T20:32:35.3477531Z     compiled=True,
2025-05-07T20:32:35.3477599Z )
2025-05-07T20:32:35.3477811Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3477979Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.3478028Z 
2025-05-07T20:32:35.3478104Z     @given(
2025-05-07T20:32:35.3478317Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3478415Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3478526Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3478641Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3478751Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3478823Z     )
2025-05-07T20:32:35.3479066Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3479197Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3479275Z         self,
2025-05-07T20:32:35.3479351Z         T: int,
2025-05-07T20:32:35.3479427Z         D: int,
2025-05-07T20:32:35.3479527Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3479613Z         contiguous: bool,
2025-05-07T20:32:35.3479697Z         compiled: bool,
2025-05-07T20:32:35.3479779Z     ) -> None:
2025-05-07T20:32:35.3479873Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3479944Z     
2025-05-07T20:32:35.3480115Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3480188Z     
2025-05-07T20:32:35.3480279Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3480405Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3480490Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3480566Z         x0 = x[:, :D]
2025-05-07T20:32:35.3480647Z         x1 = x[:, D:]
2025-05-07T20:32:35.3480718Z     
2025-05-07T20:32:35.3480802Z         if contiguous:
2025-05-07T20:32:35.3480892Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3480978Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3481049Z     
2025-05-07T20:32:35.3481136Z         if scale_ub is not None:
2025-05-07T20:32:35.3481237Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3481369Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3481444Z             )
2025-05-07T20:32:35.3481520Z         else:
2025-05-07T20:32:35.3481615Z             scale_ub_tensor = None
2025-05-07T20:32:35.3481693Z     
2025-05-07T20:32:35.3481818Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3481907Z             op = silu_mul_quant
2025-05-07T20:32:35.3481992Z             if compiled:
2025-05-07T20:32:35.3482091Z                 op = torch.compile(op)
2025-05-07T20:32:35.3482195Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3482265Z     
2025-05-07T20:32:35.3482363Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3482367Z 
2025-05-07T20:32:35.3482462Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3482592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3482695Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3482790Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3483149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3483244Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3483731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3483828Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3484177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3484395Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3484730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3484821Z     kernel = self.compile(
2025-05-07T20:32:35.3485192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3485367Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3485540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3485616Z 
2025-05-07T20:32:35.3485853Z self = <triton.compiler.compiler.ASTSource object at 0x7fb102ba9b20>
2025-05-07T20:32:35.3486625Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3487119Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10852c900>}
2025-05-07T20:32:35.3487884Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3488074Z context = <triton._C.libtriton.ir.context object at 0x7fb017015cb0>
2025-05-07T20:32:35.3488079Z 
2025-05-07T20:32:35.3488248Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3488503Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3488609Z                            module_map=module_map)
2025-05-07T20:32:35.3488763Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3488859Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3488939Z E       ^
2025-05-07T20:32:35.3489288Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3489293Z 
2025-05-07T20:32:35.3489694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3489702Z 
2025-05-07T20:32:35.3489800Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3490019Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3490099Z     T=4096,
2025-05-07T20:32:35.3490178Z     D=5120,
2025-05-07T20:32:35.3490259Z     scale_ub=None,
2025-05-07T20:32:35.3490351Z     contiguous=False,
2025-05-07T20:32:35.3490431Z     compiled=True,
2025-05-07T20:32:35.3490504Z )
2025-05-07T20:32:35.3490719Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3490885Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.3490893Z 
2025-05-07T20:32:35.3490967Z     @given(
2025-05-07T20:32:35.3491082Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3491177Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3491291Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3491405Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3491514Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3491592Z     )
2025-05-07T20:32:35.3491834Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3491921Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3491998Z         self,
2025-05-07T20:32:35.3492072Z         T: int,
2025-05-07T20:32:35.3492145Z         D: int,
2025-05-07T20:32:35.3492246Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3492330Z         contiguous: bool,
2025-05-07T20:32:35.3492415Z         compiled: bool,
2025-05-07T20:32:35.3492487Z     ) -> None:
2025-05-07T20:32:35.3492581Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3492656Z     
2025-05-07T20:32:35.3492818Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3492890Z     
2025-05-07T20:32:35.3492985Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3493174Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3493258Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3493389Z         x0 = x[:, :D]
2025-05-07T20:32:35.3493467Z         x1 = x[:, D:]
2025-05-07T20:32:35.3493536Z     
2025-05-07T20:32:35.3493695Z         if contiguous:
2025-05-07T20:32:35.3493785Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3493877Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3493945Z     
2025-05-07T20:32:35.3494032Z         if scale_ub is not None:
2025-05-07T20:32:35.3494136Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3494267Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3494342Z             )
2025-05-07T20:32:35.3494462Z         else:
2025-05-07T20:32:35.3494554Z             scale_ub_tensor = None
2025-05-07T20:32:35.3494621Z     
2025-05-07T20:32:35.3494750Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3494838Z             op = silu_mul_quant
2025-05-07T20:32:35.3494917Z             if compiled:
2025-05-07T20:32:35.3495014Z                 op = torch.compile(op)
2025-05-07T20:32:35.3495119Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3495191Z     
2025-05-07T20:32:35.3495281Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3495292Z 
2025-05-07T20:32:35.3495386Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3495518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3495618Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3495721Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3496124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3496215Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3496696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3496794Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3497142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3497362Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3497697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3497787Z     kernel = self.compile(
2025-05-07T20:32:35.3498162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3498333Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3498461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3498465Z 
2025-05-07T20:32:35.3498664Z self = <triton.compiler.compiler.ASTSource object at 0x7fb1031d1850>
2025-05-07T20:32:35.3499420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3499923Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017a034c0>}
2025-05-07T20:32:35.3500648Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3500834Z context = <triton._C.libtriton.ir.context object at 0x7fb01701e4f0>
2025-05-07T20:32:35.3500842Z 
2025-05-07T20:32:35.3501003Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3501255Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3501360Z                            module_map=module_map)
2025-05-07T20:32:35.3501516Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3501664Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3501738Z E       ^
2025-05-07T20:32:35.3502154Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3502159Z 
2025-05-07T20:32:35.3502567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3502572Z 
2025-05-07T20:32:35.3502669Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3502890Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3503004Z     T=4096,
2025-05-07T20:32:35.3503078Z     D=5120,
2025-05-07T20:32:35.3503161Z     scale_ub=1200.0,
2025-05-07T20:32:35.3503245Z     contiguous=False,
2025-05-07T20:32:35.3503329Z     compiled=False,
2025-05-07T20:32:35.3503404Z )
2025-05-07T20:32:35.3503617Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3503791Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.3503795Z 
2025-05-07T20:32:35.3503877Z     @given(
2025-05-07T20:32:35.3503992Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3504094Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3504206Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3504320Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3504433Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3504508Z     )
2025-05-07T20:32:35.3504746Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3504837Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3504913Z         self,
2025-05-07T20:32:35.3504986Z         T: int,
2025-05-07T20:32:35.3505061Z         D: int,
2025-05-07T20:32:35.3505157Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3505242Z         contiguous: bool,
2025-05-07T20:32:35.3505332Z         compiled: bool,
2025-05-07T20:32:35.3505421Z     ) -> None:
2025-05-07T20:32:35.3505527Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3505620Z     
2025-05-07T20:32:35.3505789Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3505867Z     
2025-05-07T20:32:35.3505956Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3506078Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3506163Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3506238Z         x0 = x[:, :D]
2025-05-07T20:32:35.3506317Z         x1 = x[:, D:]
2025-05-07T20:32:35.3506387Z     
2025-05-07T20:32:35.3506467Z         if contiguous:
2025-05-07T20:32:35.3506552Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3506642Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3506713Z     
2025-05-07T20:32:35.3506799Z         if scale_ub is not None:
2025-05-07T20:32:35.3506904Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3507037Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3507115Z             )
2025-05-07T20:32:35.3507193Z         else:
2025-05-07T20:32:35.3507290Z             scale_ub_tensor = None
2025-05-07T20:32:35.3507363Z     
2025-05-07T20:32:35.3507489Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3507575Z             op = silu_mul_quant
2025-05-07T20:32:35.3507659Z             if compiled:
2025-05-07T20:32:35.3507754Z                 op = torch.compile(op)
2025-05-07T20:32:35.3507857Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3507933Z     
2025-05-07T20:32:35.3508019Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3508024Z 
2025-05-07T20:32:35.3508120Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3508244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3508342Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3508441Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3509075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3509168Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3509520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3509736Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3510079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3510208Z     kernel = self.compile(
2025-05-07T20:32:35.3510581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3510752Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3510875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3510882Z 
2025-05-07T20:32:35.3511084Z self = <triton.compiler.compiler.ASTSource object at 0x7fb1031d1d30>
2025-05-07T20:32:35.3511846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3512342Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017a02840>}
2025-05-07T20:32:35.3513073Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3513259Z context = <triton._C.libtriton.ir.context object at 0x7fb017667df0>
2025-05-07T20:32:35.3513264Z 
2025-05-07T20:32:35.3513428Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3513688Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3513791Z                            module_map=module_map)
2025-05-07T20:32:35.3513951Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3514047Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3514118Z E       ^
2025-05-07T20:32:35.3514465Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3514473Z 
2025-05-07T20:32:35.3514878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3514882Z 
2025-05-07T20:32:35.3514982Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3515198Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3515274Z     T=4096,
2025-05-07T20:32:35.3515350Z     D=5120,
2025-05-07T20:32:35.3515431Z     scale_ub=1200.0,
2025-05-07T20:32:35.3515525Z     contiguous=False,
2025-05-07T20:32:35.3515610Z     compiled=True,
2025-05-07T20:32:35.3515700Z )
2025-05-07T20:32:35.3515945Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3516116Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.3516121Z 
2025-05-07T20:32:35.3516193Z     @given(
2025-05-07T20:32:35.3516312Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3516412Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3516522Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3516638Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3516747Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3516818Z     )
2025-05-07T20:32:35.3517056Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3517192Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3517270Z         self,
2025-05-07T20:32:35.3517421Z         T: int,
2025-05-07T20:32:35.3517496Z         D: int,
2025-05-07T20:32:35.3517594Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3517680Z         contiguous: bool,
2025-05-07T20:32:35.3517761Z         compiled: bool,
2025-05-07T20:32:35.3517839Z     ) -> None:
2025-05-07T20:32:35.3517929Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3518001Z     
2025-05-07T20:32:35.3518171Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3518285Z     
2025-05-07T20:32:35.3518379Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3518501Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3518586Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3518665Z         x0 = x[:, :D]
2025-05-07T20:32:35.3518740Z         x1 = x[:, D:]
2025-05-07T20:32:35.3518812Z     
2025-05-07T20:32:35.3518897Z         if contiguous:
2025-05-07T20:32:35.3518988Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3519082Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3519153Z     
2025-05-07T20:32:35.3519240Z         if scale_ub is not None:
2025-05-07T20:32:35.3519341Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3519472Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3519546Z             )
2025-05-07T20:32:35.3519620Z         else:
2025-05-07T20:32:35.3519714Z             scale_ub_tensor = None
2025-05-07T20:32:35.3519785Z     
2025-05-07T20:32:35.3519914Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3520001Z             op = silu_mul_quant
2025-05-07T20:32:35.3520081Z             if compiled:
2025-05-07T20:32:35.3520182Z                 op = torch.compile(op)
2025-05-07T20:32:35.3520284Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3520353Z     
2025-05-07T20:32:35.3520447Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3520451Z 
2025-05-07T20:32:35.3520544Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3520674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3520774Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3520870Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3521235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3521324Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3521806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3521908Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3522255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3522472Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3522808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3522901Z     kernel = self.compile(
2025-05-07T20:32:35.3523277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3523446Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3523569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3523573Z 
2025-05-07T20:32:35.3523782Z self = <triton.compiler.compiler.ASTSource object at 0x7fb102e12ae0>
2025-05-07T20:32:35.3524539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3525035Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017a016c0>}
2025-05-07T20:32:35.3525891Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3526110Z context = <triton._C.libtriton.ir.context object at 0x7fb016f78ff0>
2025-05-07T20:32:35.3526116Z 
2025-05-07T20:32:35.3526293Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3526587Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3526696Z                            module_map=module_map)
2025-05-07T20:32:35.3526852Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3526945Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3527022Z E       ^
2025-05-07T20:32:35.3527369Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3527378Z 
2025-05-07T20:32:35.3527784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3527789Z 
2025-05-07T20:32:35.3527887Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3528103Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3528180Z     T=2048,
2025-05-07T20:32:35.3528255Z     D=7168,
2025-05-07T20:32:35.3528333Z     scale_ub=1200.0,
2025-05-07T20:32:35.3528419Z     contiguous=False,
2025-05-07T20:32:35.3528499Z     compiled=False,
2025-05-07T20:32:35.3528570Z )
2025-05-07T20:32:35.3528783Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3528952Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.3528959Z 
2025-05-07T20:32:35.3529038Z     @given(
2025-05-07T20:32:35.3529153Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3529256Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3529370Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3529483Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3529593Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3529668Z     )
2025-05-07T20:32:35.3529907Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3530001Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3530075Z         self,
2025-05-07T20:32:35.3530149Z         T: int,
2025-05-07T20:32:35.3530224Z         D: int,
2025-05-07T20:32:35.3530320Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3530406Z         contiguous: bool,
2025-05-07T20:32:35.3530493Z         compiled: bool,
2025-05-07T20:32:35.3530569Z     ) -> None:
2025-05-07T20:32:35.3530662Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3530737Z     
2025-05-07T20:32:35.3530906Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3530978Z     
2025-05-07T20:32:35.3531069Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3531189Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3531279Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3531357Z         x0 = x[:, :D]
2025-05-07T20:32:35.3531435Z         x1 = x[:, D:]
2025-05-07T20:32:35.3531510Z     
2025-05-07T20:32:35.3531591Z         if contiguous:
2025-05-07T20:32:35.3531685Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3531775Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3531844Z     
2025-05-07T20:32:35.3531930Z         if scale_ub is not None:
2025-05-07T20:32:35.3532037Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3532165Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3532283Z             )
2025-05-07T20:32:35.3532360Z         else:
2025-05-07T20:32:35.3532451Z             scale_ub_tensor = None
2025-05-07T20:32:35.3532520Z     
2025-05-07T20:32:35.3532720Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3532807Z             op = silu_mul_quant
2025-05-07T20:32:35.3532891Z             if compiled:
2025-05-07T20:32:35.3533033Z                 op = torch.compile(op)
2025-05-07T20:32:35.3533136Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3533206Z     
2025-05-07T20:32:35.3533293Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3533361Z 
2025-05-07T20:32:35.3533453Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3533581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3533680Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3533777Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3534265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3534362Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3534722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3534938Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3535269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3535362Z     kernel = self.compile(
2025-05-07T20:32:35.3535733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3535911Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3536034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3536038Z 
2025-05-07T20:32:35.3536239Z self = <triton.compiler.compiler.ASTSource object at 0x7fb1083b6570>
2025-05-07T20:32:35.3537005Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3537497Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017e471a0>}
2025-05-07T20:32:35.3538231Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3538417Z context = <triton._C.libtriton.ir.context object at 0x7fb017207670>
2025-05-07T20:32:35.3538422Z 
2025-05-07T20:32:35.3538582Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3538841Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3538950Z                            module_map=module_map)
2025-05-07T20:32:35.3539115Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3539211Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3539287Z E       ^
2025-05-07T20:32:35.3539636Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3539640Z 
2025-05-07T20:32:35.3540043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3540050Z 
2025-05-07T20:32:35.3540151Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3540367Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3540440Z     T=1,
2025-05-07T20:32:35.3540518Z     D=7168,
2025-05-07T20:32:35.3540595Z     scale_ub=None,
2025-05-07T20:32:35.3540724Z     contiguous=True,
2025-05-07T20:32:35.3540808Z     compiled=False,
2025-05-07T20:32:35.3540878Z )
2025-05-07T20:32:35.3541164Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3541327Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.3541331Z 
2025-05-07T20:32:35.3541403Z     @given(
2025-05-07T20:32:35.3541522Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3541620Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3541730Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3541884Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3541993Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3542064Z     )
2025-05-07T20:32:35.3542306Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3542397Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3542473Z         self,
2025-05-07T20:32:35.3542549Z         T: int,
2025-05-07T20:32:35.3542622Z         D: int,
2025-05-07T20:32:35.3542720Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3542814Z         contiguous: bool,
2025-05-07T20:32:35.3542899Z         compiled: bool,
2025-05-07T20:32:35.3542977Z     ) -> None:
2025-05-07T20:32:35.3543070Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3543141Z     
2025-05-07T20:32:35.3543310Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3543381Z     
2025-05-07T20:32:35.3543471Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3543599Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3543683Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3543758Z         x0 = x[:, :D]
2025-05-07T20:32:35.3543838Z         x1 = x[:, D:]
2025-05-07T20:32:35.3543909Z     
2025-05-07T20:32:35.3543988Z         if contiguous:
2025-05-07T20:32:35.3544082Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3544172Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3544247Z     
2025-05-07T20:32:35.3544334Z         if scale_ub is not None:
2025-05-07T20:32:35.3544440Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3544572Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3544649Z             )
2025-05-07T20:32:35.3544724Z         else:
2025-05-07T20:32:35.3544819Z             scale_ub_tensor = None
2025-05-07T20:32:35.3544892Z     
2025-05-07T20:32:35.3545018Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3545109Z             op = silu_mul_quant
2025-05-07T20:32:35.3545191Z             if compiled:
2025-05-07T20:32:35.3545288Z                 op = torch.compile(op)
2025-05-07T20:32:35.3545393Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3545460Z     
2025-05-07T20:32:35.3545550Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3545554Z 
2025-05-07T20:32:35.3545648Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3545775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3545877Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3545974Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3546464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3546560Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3546909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3547134Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3547466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3547555Z     kernel = self.compile(
2025-05-07T20:32:35.3547932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3548150Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3548344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3548352Z 
2025-05-07T20:32:35.3548552Z self = <triton.compiler.compiler.ASTSource object at 0x7fb10834b980>
2025-05-07T20:32:35.3549307Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3549842Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017e45bc0>}
2025-05-07T20:32:35.3550571Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3550761Z context = <triton._C.libtriton.ir.context object at 0x7fb017b8e6f0>
2025-05-07T20:32:35.3550770Z 
2025-05-07T20:32:35.3550928Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3551182Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3551288Z                            module_map=module_map)
2025-05-07T20:32:35.3551444Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3551545Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3551623Z E       ^
2025-05-07T20:32:35.3551968Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3551973Z 
2025-05-07T20:32:35.3552378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3552385Z 
2025-05-07T20:32:35.3552485Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3552708Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3552787Z     T=16384,
2025-05-07T20:32:35.3552862Z     D=7168,
2025-05-07T20:32:35.3552947Z     scale_ub=1200.0,
2025-05-07T20:32:35.3553029Z     contiguous=False,
2025-05-07T20:32:35.3553109Z     compiled=True,
2025-05-07T20:32:35.3553182Z )
2025-05-07T20:32:35.3553396Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3553570Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.3553577Z 
2025-05-07T20:32:35.3553655Z     @given(
2025-05-07T20:32:35.3553771Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3553870Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3553982Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3554094Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3554212Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3554282Z     )
2025-05-07T20:32:35.3554525Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3554620Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3554694Z         self,
2025-05-07T20:32:35.3554771Z         T: int,
2025-05-07T20:32:35.3554849Z         D: int,
2025-05-07T20:32:35.3554946Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3555031Z         contiguous: bool,
2025-05-07T20:32:35.3555116Z         compiled: bool,
2025-05-07T20:32:35.3555190Z     ) -> None:
2025-05-07T20:32:35.3555285Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3555358Z     
2025-05-07T20:32:35.3555519Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3555594Z     
2025-05-07T20:32:35.3555683Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3555809Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3555945Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3556025Z         x0 = x[:, :D]
2025-05-07T20:32:35.3556178Z         x1 = x[:, D:]
2025-05-07T20:32:35.3556250Z     
2025-05-07T20:32:35.3556330Z         if contiguous:
2025-05-07T20:32:35.3556419Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3556512Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3556580Z     
2025-05-07T20:32:35.3556668Z         if scale_ub is not None:
2025-05-07T20:32:35.3556776Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3556907Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3557020Z             )
2025-05-07T20:32:35.3557097Z         else:
2025-05-07T20:32:35.3557189Z             scale_ub_tensor = None
2025-05-07T20:32:35.3557260Z     
2025-05-07T20:32:35.3557384Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3557472Z             op = silu_mul_quant
2025-05-07T20:32:35.3557558Z             if compiled:
2025-05-07T20:32:35.3557657Z                 op = torch.compile(op)
2025-05-07T20:32:35.3557759Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3557839Z     
2025-05-07T20:32:35.3557926Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3557930Z 
2025-05-07T20:32:35.3558023Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3558152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3558247Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3558350Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3558713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3558803Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3559496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3559616Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3559968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3560196Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3560525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3560616Z     kernel = self.compile(
2025-05-07T20:32:35.3560988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3561160Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3561284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3561288Z 
2025-05-07T20:32:35.3561486Z self = <triton.compiler.compiler.ASTSource object at 0x7fb103986900>
2025-05-07T20:32:35.3562251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3562747Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017e44b80>}
2025-05-07T20:32:35.3563475Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3563670Z context = <triton._C.libtriton.ir.context object at 0x7fb0176331f0>
2025-05-07T20:32:35.3563675Z 
2025-05-07T20:32:35.3563835Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3564090Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3564193Z                            module_map=module_map)
2025-05-07T20:32:35.3564434Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3564666Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3564739Z E       ^
2025-05-07T20:32:35.3565091Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3565096Z 
2025-05-07T20:32:35.3565501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3565506Z 
2025-05-07T20:32:35.3565663Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3565883Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3565958Z     T=1,
2025-05-07T20:32:35.3566032Z     D=7168,
2025-05-07T20:32:35.3566119Z     scale_ub=None,
2025-05-07T20:32:35.3566204Z     contiguous=False,
2025-05-07T20:32:35.3566282Z     compiled=False,
2025-05-07T20:32:35.3566356Z )
2025-05-07T20:32:35.3566573Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3566744Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.3566749Z 
2025-05-07T20:32:35.3566823Z     @given(
2025-05-07T20:32:35.3566939Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3567041Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3570827Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3570959Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3571081Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3571155Z     )
2025-05-07T20:32:35.3571396Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3571491Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3571566Z         self,
2025-05-07T20:32:35.3571642Z         T: int,
2025-05-07T20:32:35.3571715Z         D: int,
2025-05-07T20:32:35.3571814Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3571904Z         contiguous: bool,
2025-05-07T20:32:35.3571987Z         compiled: bool,
2025-05-07T20:32:35.3572067Z     ) -> None:
2025-05-07T20:32:35.3572162Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3572230Z     
2025-05-07T20:32:35.3572395Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3572466Z     
2025-05-07T20:32:35.3572554Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3572676Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3572766Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3572851Z         x0 = x[:, :D]
2025-05-07T20:32:35.3572930Z         x1 = x[:, D:]
2025-05-07T20:32:35.3573065Z     
2025-05-07T20:32:35.3573146Z         if contiguous:
2025-05-07T20:32:35.3573238Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3573324Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3573393Z     
2025-05-07T20:32:35.3573483Z         if scale_ub is not None:
2025-05-07T20:32:35.3573588Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3573719Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3573796Z             )
2025-05-07T20:32:35.3573871Z         else:
2025-05-07T20:32:35.3573962Z             scale_ub_tensor = None
2025-05-07T20:32:35.3574036Z     
2025-05-07T20:32:35.3574164Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3574252Z             op = silu_mul_quant
2025-05-07T20:32:35.3574336Z             if compiled:
2025-05-07T20:32:35.3574431Z                 op = torch.compile(op)
2025-05-07T20:32:35.3574539Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3574610Z     
2025-05-07T20:32:35.3574697Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3574702Z 
2025-05-07T20:32:35.3574797Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3574924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3575091Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3575191Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3575754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3575855Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3576207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3576427Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3576800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3576890Z     kernel = self.compile(
2025-05-07T20:32:35.3577263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3577437Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3577562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3577567Z 
2025-05-07T20:32:35.3577778Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017850cb0>
2025-05-07T20:32:35.3578541Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3579034Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10243b9c0>}
2025-05-07T20:32:35.3579771Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3579956Z context = <triton._C.libtriton.ir.context object at 0x7fb0176499b0>
2025-05-07T20:32:35.3579963Z 
2025-05-07T20:32:35.3580127Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3580385Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3580493Z                            module_map=module_map)
2025-05-07T20:32:35.3580650Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3580747Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3580825Z E       ^
2025-05-07T20:32:35.3581171Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3581179Z 
2025-05-07T20:32:35.3581580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3581585Z 
2025-05-07T20:32:35.3581689Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3581906Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3581987Z     T=2048,
2025-05-07T20:32:35.3582063Z     D=7168,
2025-05-07T20:32:35.3582143Z     scale_ub=None,
2025-05-07T20:32:35.3582228Z     contiguous=False,
2025-05-07T20:32:35.3582306Z     compiled=True,
2025-05-07T20:32:35.3582378Z )
2025-05-07T20:32:35.3582594Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3582760Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.3582764Z 
2025-05-07T20:32:35.3582839Z     @given(
2025-05-07T20:32:35.3582960Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3583057Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3583172Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3583285Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3583396Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3583514Z     )
2025-05-07T20:32:35.3583754Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3583917Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3583995Z         self,
2025-05-07T20:32:35.3584070Z         T: int,
2025-05-07T20:32:35.3584146Z         D: int,
2025-05-07T20:32:35.3584243Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3584330Z         contiguous: bool,
2025-05-07T20:32:35.3584414Z         compiled: bool,
2025-05-07T20:32:35.3584493Z     ) -> None:
2025-05-07T20:32:35.3584584Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3584699Z     
2025-05-07T20:32:35.3584863Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3584939Z     
2025-05-07T20:32:35.3585032Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3585154Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3585242Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3585322Z         x0 = x[:, :D]
2025-05-07T20:32:35.3585403Z         x1 = x[:, D:]
2025-05-07T20:32:35.3585473Z     
2025-05-07T20:32:35.3585557Z         if contiguous:
2025-05-07T20:32:35.3585649Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3585733Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3585806Z     
2025-05-07T20:32:35.3585893Z         if scale_ub is not None:
2025-05-07T20:32:35.3586000Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3586128Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3586200Z             )
2025-05-07T20:32:35.3586277Z         else:
2025-05-07T20:32:35.3586371Z             scale_ub_tensor = None
2025-05-07T20:32:35.3586438Z     
2025-05-07T20:32:35.3586567Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3586652Z             op = silu_mul_quant
2025-05-07T20:32:35.3586731Z             if compiled:
2025-05-07T20:32:35.3586827Z                 op = torch.compile(op)
2025-05-07T20:32:35.3586928Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3586999Z     
2025-05-07T20:32:35.3587089Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3587094Z 
2025-05-07T20:32:35.3587192Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3587320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3587424Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3587521Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3587884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3587976Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3588458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3588555Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3588903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3589125Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3589460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3589550Z     kernel = self.compile(
2025-05-07T20:32:35.3589926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3590094Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3590217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3590229Z 
2025-05-07T20:32:35.3590431Z self = <triton.compiler.compiler.ASTSource object at 0x7fb0178528a0>
2025-05-07T20:32:35.3591192Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3591810Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb102439b20>}
2025-05-07T20:32:35.3592541Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3592733Z context = <triton._C.libtriton.ir.context object at 0x7fb1021e4b30>
2025-05-07T20:32:35.3592776Z 
2025-05-07T20:32:35.3592938Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3593194Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3593299Z                            module_map=module_map)
2025-05-07T20:32:35.3593455Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3593555Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3593628Z E       ^
2025-05-07T20:32:35.3593980Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3593984Z 
2025-05-07T20:32:35.3594391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3594395Z 
2025-05-07T20:32:35.3594492Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3594716Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3594791Z     T=4096,
2025-05-07T20:32:35.3594866Z     D=7168,
2025-05-07T20:32:35.3594949Z     scale_ub=None,
2025-05-07T20:32:35.3595035Z     contiguous=False,
2025-05-07T20:32:35.3595114Z     compiled=True,
2025-05-07T20:32:35.3595185Z )
2025-05-07T20:32:35.3595400Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3595568Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.3595575Z 
2025-05-07T20:32:35.3595653Z     @given(
2025-05-07T20:32:35.3595771Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3595866Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3595981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3596094Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3596207Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3596281Z     )
2025-05-07T20:32:35.3596520Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3596614Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3596690Z         self,
2025-05-07T20:32:35.3596764Z         T: int,
2025-05-07T20:32:35.3596840Z         D: int,
2025-05-07T20:32:35.3596936Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3597022Z         contiguous: bool,
2025-05-07T20:32:35.3597105Z         compiled: bool,
2025-05-07T20:32:35.3597184Z     ) -> None:
2025-05-07T20:32:35.3597276Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3597348Z     
2025-05-07T20:32:35.3597516Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3597591Z     
2025-05-07T20:32:35.3597679Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3597802Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3597892Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3597968Z         x0 = x[:, :D]
2025-05-07T20:32:35.3598046Z         x1 = x[:, D:]
2025-05-07T20:32:35.3598125Z     
2025-05-07T20:32:35.3598205Z         if contiguous:
2025-05-07T20:32:35.3598291Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3598382Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3598452Z     
2025-05-07T20:32:35.3598539Z         if scale_ub is not None:
2025-05-07T20:32:35.3598643Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3598772Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3598892Z             )
2025-05-07T20:32:35.3598967Z         else:
2025-05-07T20:32:35.3599154Z             scale_ub_tensor = None
2025-05-07T20:32:35.3599227Z     
2025-05-07T20:32:35.3599353Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3599442Z             op = silu_mul_quant
2025-05-07T20:32:35.3599530Z             if compiled:
2025-05-07T20:32:35.3599627Z                 op = torch.compile(op)
2025-05-07T20:32:35.3599727Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3599800Z     
2025-05-07T20:32:35.3599929Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3599934Z 
2025-05-07T20:32:35.3600025Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3600155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3600252Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3600350Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3600710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3600804Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3601294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3601388Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3601735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3601955Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3602289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3602386Z     kernel = self.compile(
2025-05-07T20:32:35.3602758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3602926Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3603061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3603072Z 
2025-05-07T20:32:35.3603272Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017d3f8c0>
2025-05-07T20:32:35.3604034Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3604532Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10314f380>}
2025-05-07T20:32:35.3605261Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3605449Z context = <triton._C.libtriton.ir.context object at 0x7fb016ee22b0>
2025-05-07T20:32:35.3605454Z 
2025-05-07T20:32:35.3605622Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3605878Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3605980Z                            module_map=module_map)
2025-05-07T20:32:35.3606140Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3606238Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3606313Z E       ^
2025-05-07T20:32:35.3606665Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3606670Z 
2025-05-07T20:32:35.3607074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3607078Z 
2025-05-07T20:32:35.3607175Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3607440Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3607584Z     T=16384,
2025-05-07T20:32:35.3607660Z     D=5120,
2025-05-07T20:32:35.3607743Z     scale_ub=1200.0,
2025-05-07T20:32:35.3607827Z     contiguous=False,
2025-05-07T20:32:35.3607919Z     compiled=False,
2025-05-07T20:32:35.3607987Z )
2025-05-07T20:32:35.3608201Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3608381Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.3608426Z 
2025-05-07T20:32:35.3608502Z     @given(
2025-05-07T20:32:35.3608617Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3608718Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3608831Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3608943Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3609058Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3609127Z     )
2025-05-07T20:32:35.3609375Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3609465Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3609538Z         self,
2025-05-07T20:32:35.3609617Z         T: int,
2025-05-07T20:32:35.3609688Z         D: int,
2025-05-07T20:32:35.3609787Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3609878Z         contiguous: bool,
2025-05-07T20:32:35.3609964Z         compiled: bool,
2025-05-07T20:32:35.3610036Z     ) -> None:
2025-05-07T20:32:35.3610133Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3610203Z     
2025-05-07T20:32:35.3610368Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3610444Z     
2025-05-07T20:32:35.3610533Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3610656Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3610741Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3610819Z         x0 = x[:, :D]
2025-05-07T20:32:35.3610899Z         x1 = x[:, D:]
2025-05-07T20:32:35.3610967Z     
2025-05-07T20:32:35.3611053Z         if contiguous:
2025-05-07T20:32:35.3611145Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3611232Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3611301Z     
2025-05-07T20:32:35.3611392Z         if scale_ub is not None:
2025-05-07T20:32:35.3611493Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3611623Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3611700Z             )
2025-05-07T20:32:35.3611774Z         else:
2025-05-07T20:32:35.3611870Z             scale_ub_tensor = None
2025-05-07T20:32:35.3611941Z     
2025-05-07T20:32:35.3612067Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3612156Z             op = silu_mul_quant
2025-05-07T20:32:35.3612236Z             if compiled:
2025-05-07T20:32:35.3612330Z                 op = torch.compile(op)
2025-05-07T20:32:35.3612437Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3612506Z     
2025-05-07T20:32:35.3612597Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3612601Z 
2025-05-07T20:32:35.3612698Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3612821Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3612923Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3613083Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3613577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3613679Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3614029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3614246Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3614628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3614720Z     kernel = self.compile(
2025-05-07T20:32:35.3615172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3615343Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3615467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3615472Z 
2025-05-07T20:32:35.3615676Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017d3eae0>
2025-05-07T20:32:35.3616475Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3616970Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10314df80>}
2025-05-07T20:32:35.3617709Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3617896Z context = <triton._C.libtriton.ir.context object at 0x7fb017b3ea30>
2025-05-07T20:32:35.3617904Z 
2025-05-07T20:32:35.3618063Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3618318Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3618424Z                            module_map=module_map)
2025-05-07T20:32:35.3618581Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3618678Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3618754Z E       ^
2025-05-07T20:32:35.3619099Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3619106Z 
2025-05-07T20:32:35.3619521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3619526Z 
2025-05-07T20:32:35.3619629Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3619846Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3619924Z     T=16384,
2025-05-07T20:32:35.3619999Z     D=5120,
2025-05-07T20:32:35.3620080Z     scale_ub=1200.0,
2025-05-07T20:32:35.3620166Z     contiguous=True,
2025-05-07T20:32:35.3620248Z     compiled=True,
2025-05-07T20:32:35.3620318Z )
2025-05-07T20:32:35.3620532Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3620702Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.3620707Z 
2025-05-07T20:32:35.3620787Z     @given(
2025-05-07T20:32:35.3620903Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3620998Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3621117Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3621230Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3621339Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3621416Z     )
2025-05-07T20:32:35.3621654Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3621743Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3621823Z         self,
2025-05-07T20:32:35.3621898Z         T: int,
2025-05-07T20:32:35.3621976Z         D: int,
2025-05-07T20:32:35.3622069Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3622154Z         contiguous: bool,
2025-05-07T20:32:35.3622239Z         compiled: bool,
2025-05-07T20:32:35.3622316Z     ) -> None:
2025-05-07T20:32:35.3622407Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3622527Z     
2025-05-07T20:32:35.3622692Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3622762Z     
2025-05-07T20:32:35.3622928Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3623052Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3623136Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3623213Z         x0 = x[:, :D]
2025-05-07T20:32:35.3623288Z         x1 = x[:, D:]
2025-05-07T20:32:35.3623363Z     
2025-05-07T20:32:35.3623443Z         if contiguous:
2025-05-07T20:32:35.3623531Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3623658Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3623731Z     
2025-05-07T20:32:35.3623818Z         if scale_ub is not None:
2025-05-07T20:32:35.3623925Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3624054Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3624124Z             )
2025-05-07T20:32:35.3624202Z         else:
2025-05-07T20:32:35.3624296Z             scale_ub_tensor = None
2025-05-07T20:32:35.3624367Z     
2025-05-07T20:32:35.3624502Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3624589Z             op = silu_mul_quant
2025-05-07T20:32:35.3624669Z             if compiled:
2025-05-07T20:32:35.3624768Z                 op = torch.compile(op)
2025-05-07T20:32:35.3624868Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3624940Z     
2025-05-07T20:32:35.3625028Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3625032Z 
2025-05-07T20:32:35.3625125Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3625255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3625355Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3625450Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3625815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3625909Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3626401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3626495Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3626848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3627072Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3627403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3627496Z     kernel = self.compile(
2025-05-07T20:32:35.3627871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3628040Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3628164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3628171Z 
2025-05-07T20:32:35.3628377Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017dd1610>
2025-05-07T20:32:35.3629133Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3629630Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb10314c180>}
2025-05-07T20:32:35.3630359Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3630549Z context = <triton._C.libtriton.ir.context object at 0x7fb017b772f0>
2025-05-07T20:32:35.3630623Z 
2025-05-07T20:32:35.3630784Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3631114Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3631218Z                            module_map=module_map)
2025-05-07T20:32:35.3631374Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3631472Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3631548Z E       ^
2025-05-07T20:32:35.3631892Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3631935Z 
2025-05-07T20:32:35.3632341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3632346Z 
2025-05-07T20:32:35.3632446Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3632667Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3632740Z     T=16384,
2025-05-07T20:32:35.3632814Z     D=5120,
2025-05-07T20:32:35.3632895Z     scale_ub=None,
2025-05-07T20:32:35.3632983Z     contiguous=False,
2025-05-07T20:32:35.3633063Z     compiled=True,
2025-05-07T20:32:35.3633133Z )
2025-05-07T20:32:35.3633346Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3633516Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.3633525Z 
2025-05-07T20:32:35.3633600Z     @given(
2025-05-07T20:32:35.3633715Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3633817Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3633928Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3634044Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3634156Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3634224Z     )
2025-05-07T20:32:35.3634465Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3634560Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3634634Z         self,
2025-05-07T20:32:35.3634711Z         T: int,
2025-05-07T20:32:35.3634788Z         D: int,
2025-05-07T20:32:35.3634883Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3634973Z         contiguous: bool,
2025-05-07T20:32:35.3635056Z         compiled: bool,
2025-05-07T20:32:35.3635132Z     ) -> None:
2025-05-07T20:32:35.3635225Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3635293Z     
2025-05-07T20:32:35.3635457Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3635536Z     
2025-05-07T20:32:35.3635625Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3635745Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3635832Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3635910Z         x0 = x[:, :D]
2025-05-07T20:32:35.3635985Z         x1 = x[:, D:]
2025-05-07T20:32:35.3636060Z     
2025-05-07T20:32:35.3636141Z         if contiguous:
2025-05-07T20:32:35.3636233Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3636325Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3636392Z     
2025-05-07T20:32:35.3636480Z         if scale_ub is not None:
2025-05-07T20:32:35.3636581Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3636710Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3636785Z             )
2025-05-07T20:32:35.3636859Z         else:
2025-05-07T20:32:35.3636948Z             scale_ub_tensor = None
2025-05-07T20:32:35.3637021Z     
2025-05-07T20:32:35.3637146Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3637231Z             op = silu_mul_quant
2025-05-07T20:32:35.3637317Z             if compiled:
2025-05-07T20:32:35.3637412Z                 op = torch.compile(op)
2025-05-07T20:32:35.3637515Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3637583Z     
2025-05-07T20:32:35.3637721Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3637726Z 
2025-05-07T20:32:35.3637824Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3638024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3638123Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3638222Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3638583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3638672Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3639196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3639290Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3639640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3639859Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3640197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3640294Z     kernel = self.compile(
2025-05-07T20:32:35.3640667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3640837Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3640962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3640970Z 
2025-05-07T20:32:35.3641170Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017d9abd0>
2025-05-07T20:32:35.3641929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3642430Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017c68b80>}
2025-05-07T20:32:35.3643161Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3643347Z context = <triton._C.libtriton.ir.context object at 0x7fb01712a0f0>
2025-05-07T20:32:35.3643351Z 
2025-05-07T20:32:35.3643510Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3643770Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3643874Z                            module_map=module_map)
2025-05-07T20:32:35.3644034Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3644129Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3644204Z E       ^
2025-05-07T20:32:35.3644562Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3644566Z 
2025-05-07T20:32:35.3644969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3644973Z 
2025-05-07T20:32:35.3645076Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3645294Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3645365Z     T=2048,
2025-05-07T20:32:35.3645447Z     D=5120,
2025-05-07T20:32:35.3645525Z     scale_ub=None,
2025-05-07T20:32:35.3645612Z     contiguous=False,
2025-05-07T20:32:35.3645693Z     compiled=True,
2025-05-07T20:32:35.3645760Z )
2025-05-07T20:32:35.3645972Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3646144Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.3646197Z 
2025-05-07T20:32:35.3646272Z     @given(
2025-05-07T20:32:35.3646392Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3646563Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3646676Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3646793Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3646902Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3646973Z     )
2025-05-07T20:32:35.3647214Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3647343Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3647417Z         self,
2025-05-07T20:32:35.3647493Z         T: int,
2025-05-07T20:32:35.3647566Z         D: int,
2025-05-07T20:32:35.3647663Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3647754Z         contiguous: bool,
2025-05-07T20:32:35.3647836Z         compiled: bool,
2025-05-07T20:32:35.3647914Z     ) -> None:
2025-05-07T20:32:35.3648009Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3648082Z     
2025-05-07T20:32:35.3648254Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3648327Z     
2025-05-07T20:32:35.3648416Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3648541Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3648628Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3648705Z         x0 = x[:, :D]
2025-05-07T20:32:35.3648785Z         x1 = x[:, D:]
2025-05-07T20:32:35.3648857Z     
2025-05-07T20:32:35.3648940Z         if contiguous:
2025-05-07T20:32:35.3649034Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3649119Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3649195Z     
2025-05-07T20:32:35.3649282Z         if scale_ub is not None:
2025-05-07T20:32:35.3649385Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3649518Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3649593Z             )
2025-05-07T20:32:35.3649666Z         else:
2025-05-07T20:32:35.3649759Z             scale_ub_tensor = None
2025-05-07T20:32:35.3649828Z     
2025-05-07T20:32:35.3649957Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3650048Z             op = silu_mul_quant
2025-05-07T20:32:35.3650129Z             if compiled:
2025-05-07T20:32:35.3650227Z                 op = torch.compile(op)
2025-05-07T20:32:35.3650331Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3650401Z     
2025-05-07T20:32:35.3650494Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3650502Z 
2025-05-07T20:32:35.3650596Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3650721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3650821Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3650915Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3651274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3651368Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3651852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3651950Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3652299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3652516Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3652851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3652941Z     kernel = self.compile(
2025-05-07T20:32:35.3653363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3653536Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3653707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3653712Z 
2025-05-07T20:32:35.3653988Z self = <triton.compiler.compiler.ASTSource object at 0x7fb017d9b710>
2025-05-07T20:32:35.3654745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3655239Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017c6a0c0>}
2025-05-07T20:32:35.3656036Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3656220Z context = <triton._C.libtriton.ir.context object at 0x7fb0171e16b0>
2025-05-07T20:32:35.3656227Z 
2025-05-07T20:32:35.3656393Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3656656Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3656761Z                            module_map=module_map)
2025-05-07T20:32:35.3656917Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3657013Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3657090Z E       ^
2025-05-07T20:32:35.3657437Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3657445Z 
2025-05-07T20:32:35.3657848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3657853Z 
2025-05-07T20:32:35.3657959Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3658179Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3658257Z     T=2048,
2025-05-07T20:32:35.3658333Z     D=5120,
2025-05-07T20:32:35.3658418Z     scale_ub=1200.0,
2025-05-07T20:32:35.3658506Z     contiguous=False,
2025-05-07T20:32:35.3658590Z     compiled=True,
2025-05-07T20:32:35.3658658Z )
2025-05-07T20:32:35.3658873Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3659042Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.3659046Z 
2025-05-07T20:32:35.3659122Z     @given(
2025-05-07T20:32:35.3659568Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3659711Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3659869Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3659985Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3660096Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3660178Z     )
2025-05-07T20:32:35.3660419Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3660517Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3660593Z         self,
2025-05-07T20:32:35.3660669Z         T: int,
2025-05-07T20:32:35.3660747Z         D: int,
2025-05-07T20:32:35.3660846Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3660930Z         contiguous: bool,
2025-05-07T20:32:35.3661010Z         compiled: bool,
2025-05-07T20:32:35.3661090Z     ) -> None:
2025-05-07T20:32:35.3661182Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3661259Z     
2025-05-07T20:32:35.3661421Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3661489Z     
2025-05-07T20:32:35.3661582Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3661707Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3661792Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3661870Z         x0 = x[:, :D]
2025-05-07T20:32:35.3662040Z         x1 = x[:, D:]
2025-05-07T20:32:35.3662113Z     
2025-05-07T20:32:35.3662197Z         if contiguous:
2025-05-07T20:32:35.3662390Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3662479Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3662553Z     
2025-05-07T20:32:35.3662641Z         if scale_ub is not None:
2025-05-07T20:32:35.3662745Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3662876Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3662948Z             )
2025-05-07T20:32:35.3663021Z         else:
2025-05-07T20:32:35.3663173Z             scale_ub_tensor = None
2025-05-07T20:32:35.3663241Z     
2025-05-07T20:32:35.3663369Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3663456Z             op = silu_mul_quant
2025-05-07T20:32:35.3663536Z             if compiled:
2025-05-07T20:32:35.3663636Z                 op = torch.compile(op)
2025-05-07T20:32:35.3663737Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3663809Z     
2025-05-07T20:32:35.3663901Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3663905Z 
2025-05-07T20:32:35.3664002Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3664132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3664230Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3664326Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3664688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3664781Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3665265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3665366Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3665713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3665940Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3666275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3666363Z     kernel = self.compile(
2025-05-07T20:32:35.3666736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3666905Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3667029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3667040Z 
2025-05-07T20:32:35.3667240Z self = <triton.compiler.compiler.ASTSource object at 0x7fb102cb0080>
2025-05-07T20:32:35.3667993Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3668495Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb017c6b2e0>}
2025-05-07T20:32:35.3669225Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3669415Z context = <triton._C.libtriton.ir.context object at 0x7fb0171e49b0>
2025-05-07T20:32:35.3669422Z 
2025-05-07T20:32:35.3669582Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3669836Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3669943Z                            module_map=module_map)
2025-05-07T20:32:35.3670101Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3670247Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3670319Z E       ^
2025-05-07T20:32:35.3670734Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3670739Z 
2025-05-07T20:32:35.3671146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3671150Z 
2025-05-07T20:32:35.3671249Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3671465Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3671580Z     T=4096,
2025-05-07T20:32:35.3671654Z     D=5120,
2025-05-07T20:32:35.3671738Z     scale_ub=1200.0,
2025-05-07T20:32:35.3671819Z     contiguous=True,
2025-05-07T20:32:35.3671895Z     compiled=True,
2025-05-07T20:32:35.3671969Z )
2025-05-07T20:32:35.3672182Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3672348Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.3672353Z 
2025-05-07T20:32:35.3672432Z     @given(
2025-05-07T20:32:35.3672552Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3672647Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3672762Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3672877Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3672988Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3673058Z     )
2025-05-07T20:32:35.3673296Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3673398Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3673469Z         self,
2025-05-07T20:32:35.3673545Z         T: int,
2025-05-07T20:32:35.3673622Z         D: int,
2025-05-07T20:32:35.3673719Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3673805Z         contiguous: bool,
2025-05-07T20:32:35.3673891Z         compiled: bool,
2025-05-07T20:32:35.3673966Z     ) -> None:
2025-05-07T20:32:35.3674056Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3674129Z     
2025-05-07T20:32:35.3674297Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3674375Z     
2025-05-07T20:32:35.3674463Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3674586Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3674676Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3674752Z         x0 = x[:, :D]
2025-05-07T20:32:35.3674827Z         x1 = x[:, D:]
2025-05-07T20:32:35.3674904Z     
2025-05-07T20:32:35.3674984Z         if contiguous:
2025-05-07T20:32:35.3675069Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3675157Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3675229Z     
2025-05-07T20:32:35.3675317Z         if scale_ub is not None:
2025-05-07T20:32:35.3675427Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3675558Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3675633Z             )
2025-05-07T20:32:35.3675709Z         else:
2025-05-07T20:32:35.3675805Z             scale_ub_tensor = None
2025-05-07T20:32:35.3675879Z     
2025-05-07T20:32:35.3676005Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3676092Z             op = silu_mul_quant
2025-05-07T20:32:35.3676179Z             if compiled:
2025-05-07T20:32:35.3676278Z                 op = torch.compile(op)
2025-05-07T20:32:35.3676380Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3676456Z     
2025-05-07T20:32:35.3676543Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3676548Z 
2025-05-07T20:32:35.3676640Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3676767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3676864Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3676963Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3677372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3677576Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3678062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3678156Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3678507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3678729Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3679101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3679194Z     kernel = self.compile(
2025-05-07T20:32:35.3679565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3679736Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3679868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3679872Z 
2025-05-07T20:32:35.3680071Z self = <triton.compiler.compiler.ASTSource object at 0x7fb102cb0830>
2025-05-07T20:32:35.3680829Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3681323Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb0172fc860>}
2025-05-07T20:32:35.3682050Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3682243Z context = <triton._C.libtriton.ir.context object at 0x7fb0172cf9f0>
2025-05-07T20:32:35.3682247Z 
2025-05-07T20:32:35.3682411Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3682668Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3682769Z                            module_map=module_map)
2025-05-07T20:32:35.3682926Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3683024Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3683100Z E       ^
2025-05-07T20:32:35.3683447Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3683452Z 
2025-05-07T20:32:35.3683858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3683862Z 
2025-05-07T20:32:35.3683964Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3684185Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3684264Z     T=128,
2025-05-07T20:32:35.3684339Z     D=5120,
2025-05-07T20:32:35.3684423Z     scale_ub=1200.0,
2025-05-07T20:32:35.3684508Z     contiguous=False,
2025-05-07T20:32:35.3684593Z     compiled=True,
2025-05-07T20:32:35.3684662Z )
2025-05-07T20:32:35.3684873Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3685042Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.3685050Z 
2025-05-07T20:32:35.3685123Z     @given(
2025-05-07T20:32:35.3685237Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3685334Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3685444Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3685556Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3685718Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3685791Z     )
2025-05-07T20:32:35.3686131Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3686225Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3686298Z         self,
2025-05-07T20:32:35.3686376Z         T: int,
2025-05-07T20:32:35.3686449Z         D: int,
2025-05-07T20:32:35.3686546Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3686635Z         contiguous: bool,
2025-05-07T20:32:35.3690385Z         compiled: bool,
2025-05-07T20:32:35.3690472Z     ) -> None:
2025-05-07T20:32:35.3690638Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3690707Z     
2025-05-07T20:32:35.3690876Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3690951Z     
2025-05-07T20:32:35.3691038Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3691160Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3691252Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3691331Z         x0 = x[:, :D]
2025-05-07T20:32:35.3691411Z         x1 = x[:, D:]
2025-05-07T20:32:35.3691482Z     
2025-05-07T20:32:35.3691568Z         if contiguous:
2025-05-07T20:32:35.3691662Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3691748Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3691815Z     
2025-05-07T20:32:35.3691905Z         if scale_ub is not None:
2025-05-07T20:32:35.3692008Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3692138Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3692217Z             )
2025-05-07T20:32:35.3692290Z         else:
2025-05-07T20:32:35.3692381Z             scale_ub_tensor = None
2025-05-07T20:32:35.3692453Z     
2025-05-07T20:32:35.3692579Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3692669Z             op = silu_mul_quant
2025-05-07T20:32:35.3692752Z             if compiled:
2025-05-07T20:32:35.3692850Z                 op = torch.compile(op)
2025-05-07T20:32:35.3692956Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3693099Z     
2025-05-07T20:32:35.3693190Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3693195Z 
2025-05-07T20:32:35.3693291Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3693417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3693515Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3693614Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3693974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3694071Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3694552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3694644Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3694995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3695219Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3695575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3695685Z     kernel = self.compile(
2025-05-07T20:32:35.3696066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3696237Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3696363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3696367Z 
2025-05-07T20:32:35.3696567Z self = <triton.compiler.compiler.ASTSource object at 0x7fb10246a660>
2025-05-07T20:32:35.3697330Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3697956Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb0172fd580>}
2025-05-07T20:32:35.3698692Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3698877Z context = <triton._C.libtriton.ir.context object at 0x7fb0172a67f0>
2025-05-07T20:32:35.3698918Z 
2025-05-07T20:32:35.3699083Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3699338Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3699442Z                            module_map=module_map)
2025-05-07T20:32:35.3699604Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3699699Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3699774Z E       ^
2025-05-07T20:32:35.3700127Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3700132Z 
2025-05-07T20:32:35.3700536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3700541Z 
2025-05-07T20:32:35.3700640Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3700858Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3700934Z     T=16384,
2025-05-07T20:32:35.3701013Z     D=7168,
2025-05-07T20:32:35.3701094Z     scale_ub=1200.0,
2025-05-07T20:32:35.3701175Z     contiguous=True,
2025-05-07T20:32:35.3701256Z     compiled=True,
2025-05-07T20:32:35.3701325Z )
2025-05-07T20:32:35.3701536Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3701715Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.3701725Z 
2025-05-07T20:32:35.3701802Z     @given(
2025-05-07T20:32:35.3701920Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3702014Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3702125Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3702241Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3702350Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3702426Z     )
2025-05-07T20:32:35.3702666Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3702756Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3702834Z         self,
2025-05-07T20:32:35.3702908Z         T: int,
2025-05-07T20:32:35.3702980Z         D: int,
2025-05-07T20:32:35.3703077Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3703167Z         contiguous: bool,
2025-05-07T20:32:35.3703250Z         compiled: bool,
2025-05-07T20:32:35.3703327Z     ) -> None:
2025-05-07T20:32:35.3703423Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3703493Z     
2025-05-07T20:32:35.3703661Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3703731Z     
2025-05-07T20:32:35.3703820Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3703946Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3704032Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3704112Z         x0 = x[:, :D]
2025-05-07T20:32:35.3704192Z         x1 = x[:, D:]
2025-05-07T20:32:35.3704261Z     
2025-05-07T20:32:35.3704346Z         if contiguous:
2025-05-07T20:32:35.3704440Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3704526Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3704600Z     
2025-05-07T20:32:35.3704688Z         if scale_ub is not None:
2025-05-07T20:32:35.3704789Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3704972Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3705042Z             )
2025-05-07T20:32:35.3705185Z         else:
2025-05-07T20:32:35.3705280Z             scale_ub_tensor = None
2025-05-07T20:32:35.3705349Z     
2025-05-07T20:32:35.3705474Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3705566Z             op = silu_mul_quant
2025-05-07T20:32:35.3705646Z             if compiled:
2025-05-07T20:32:35.3705748Z                 op = torch.compile(op)
2025-05-07T20:32:35.3705889Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3705959Z     
2025-05-07T20:32:35.3706051Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3706056Z 
2025-05-07T20:32:35.3706148Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3706271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3706369Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3706469Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3706832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3706924Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3707406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3707503Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3707850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3708071Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3708403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3708493Z     kernel = self.compile(
2025-05-07T20:32:35.3708868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3709041Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3709168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3709172Z 
2025-05-07T20:32:35.3709374Z self = <triton.compiler.compiler.ASTSource object at 0x7fb102468860>
2025-05-07T20:32:35.3710131Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3710628Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb0172fe0c0>}
2025-05-07T20:32:35.3711354Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3711545Z context = <triton._C.libtriton.ir.context object at 0x7fb016da7570>
2025-05-07T20:32:35.3711549Z 
2025-05-07T20:32:35.3711714Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3711969Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3712073Z                            module_map=module_map)
2025-05-07T20:32:35.3712230Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3712330Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3712407Z E       ^
2025-05-07T20:32:35.3712753Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3712758Z 
2025-05-07T20:32:35.3713164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3713214Z 
2025-05-07T20:32:35.3713312Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3713599Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3713677Z     T=16384,
2025-05-07T20:32:35.3713751Z     D=5120,
2025-05-07T20:32:35.3713831Z     scale_ub=1200.0,
2025-05-07T20:32:35.3713916Z     contiguous=True,
2025-05-07T20:32:35.3713996Z     compiled=False,
2025-05-07T20:32:35.3714069Z )
2025-05-07T20:32:35.3714283Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3714495Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.3714499Z 
2025-05-07T20:32:35.3714576Z     @given(
2025-05-07T20:32:35.3714693Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3714791Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3714905Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3715021Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3715131Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3715206Z     )
2025-05-07T20:32:35.3715444Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3715533Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3715625Z         self,
2025-05-07T20:32:35.3715705Z         T: int,
2025-05-07T20:32:35.3715799Z         D: int,
2025-05-07T20:32:35.3715907Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3715993Z         contiguous: bool,
2025-05-07T20:32:35.3716082Z         compiled: bool,
2025-05-07T20:32:35.3716155Z     ) -> None:
2025-05-07T20:32:35.3716246Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3716320Z     
2025-05-07T20:32:35.3716483Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3716554Z     
2025-05-07T20:32:35.3716644Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3716765Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3716854Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3716933Z         x0 = x[:, :D]
2025-05-07T20:32:35.3717016Z         x1 = x[:, D:]
2025-05-07T20:32:35.3717086Z     
2025-05-07T20:32:35.3717170Z         if contiguous:
2025-05-07T20:32:35.3717257Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3717345Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3717417Z     
2025-05-07T20:32:35.3717505Z         if scale_ub is not None:
2025-05-07T20:32:35.3717610Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3717745Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3717818Z             )
2025-05-07T20:32:35.3717892Z         else:
2025-05-07T20:32:35.3717983Z             scale_ub_tensor = None
2025-05-07T20:32:35.3718055Z     
2025-05-07T20:32:35.3718185Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3718273Z             op = silu_mul_quant
2025-05-07T20:32:35.3718357Z             if compiled:
2025-05-07T20:32:35.3718455Z                 op = torch.compile(op)
2025-05-07T20:32:35.3718560Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3718633Z     
2025-05-07T20:32:35.3718721Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3718727Z 
2025-05-07T20:32:35.3718819Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3718947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3719043Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3719140Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3719633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3719726Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3720080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3720296Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3720771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3720865Z     kernel = self.compile(
2025-05-07T20:32:35.3721238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3721407Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3721534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3721577Z 
2025-05-07T20:32:35.3721777Z self = <triton.compiler.compiler.ASTSource object at 0x7fb102b49c40>
2025-05-07T20:32:35.3722538Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3725537Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb0172ff1a0>}
2025-05-07T20:32:35.3726291Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3726485Z context = <triton._C.libtriton.ir.context object at 0x7fb016b8fc70>
2025-05-07T20:32:35.3726491Z 
2025-05-07T20:32:35.3726651Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3726917Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3727023Z                            module_map=module_map)
2025-05-07T20:32:35.3727182Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3727287Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3727367Z E       ^
2025-05-07T20:32:35.3727723Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3727750Z 
2025-05-07T20:32:35.3728158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3728163Z 
2025-05-07T20:32:35.3728268Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3728491Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3728566Z     T=1,
2025-05-07T20:32:35.3728643Z     D=7168,
2025-05-07T20:32:35.3728729Z     scale_ub=1200.0,
2025-05-07T20:32:35.3728812Z     contiguous=False,
2025-05-07T20:32:35.3728892Z     compiled=False,
2025-05-07T20:32:35.3728966Z )
2025-05-07T20:32:35.3729177Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3729341Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.3729349Z 
2025-05-07T20:32:35.3729429Z     @given(
2025-05-07T20:32:35.3729546Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3729653Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3729765Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3729882Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3729999Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3730072Z     )
2025-05-07T20:32:35.3730312Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3730410Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3730483Z         self,
2025-05-07T20:32:35.3730557Z         T: int,
2025-05-07T20:32:35.3730633Z         D: int,
2025-05-07T20:32:35.3730729Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3730814Z         contiguous: bool,
2025-05-07T20:32:35.3730898Z         compiled: bool,
2025-05-07T20:32:35.3731032Z     ) -> None:
2025-05-07T20:32:35.3731129Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3731202Z     
2025-05-07T20:32:35.3731408Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3731485Z     
2025-05-07T20:32:35.3731574Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3731698Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3731788Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3731866Z         x0 = x[:, :D]
2025-05-07T20:32:35.3731945Z         x1 = x[:, D:]
2025-05-07T20:32:35.3732020Z     
2025-05-07T20:32:35.3732141Z         if contiguous:
2025-05-07T20:32:35.3732230Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3732319Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3732390Z     
2025-05-07T20:32:35.3732479Z         if scale_ub is not None:
2025-05-07T20:32:35.3732581Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3732710Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3732791Z             )
2025-05-07T20:32:35.3732869Z         else:
2025-05-07T20:32:35.3732959Z             scale_ub_tensor = None
2025-05-07T20:32:35.3733097Z     
2025-05-07T20:32:35.3733307Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3733396Z             op = silu_mul_quant
2025-05-07T20:32:35.3733479Z             if compiled:
2025-05-07T20:32:35.3733576Z                 op = torch.compile(op)
2025-05-07T20:32:35.3733677Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3733752Z     
2025-05-07T20:32:35.3733840Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3733848Z 
2025-05-07T20:32:35.3733948Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3734074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3734170Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3734268Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3734758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3734855Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3735215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3735433Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3735767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3735856Z     kernel = self.compile(
2025-05-07T20:32:35.3736231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3736406Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3736530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3736534Z 
2025-05-07T20:32:35.3736740Z self = <triton.compiler.compiler.ASTSource object at 0x7fb102b4bbf0>
2025-05-07T20:32:35.3737506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3737999Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb016dbc680>}
2025-05-07T20:32:35.3738734Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3738922Z context = <triton._C.libtriton.ir.context object at 0x7fb016e34270>
2025-05-07T20:32:35.3738926Z 
2025-05-07T20:32:35.3739090Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3739392Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3739534Z                            module_map=module_map)
2025-05-07T20:32:35.3739703Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3739797Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3739870Z E       ^
2025-05-07T20:32:35.3740220Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3740225Z 
2025-05-07T20:32:35.3740628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3740671Z 
2025-05-07T20:32:35.3740774Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3740989Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3741061Z     T=4096,
2025-05-07T20:32:35.3741141Z     D=7168,
2025-05-07T20:32:35.3741226Z     scale_ub=1200.0,
2025-05-07T20:32:35.3741316Z     contiguous=False,
2025-05-07T20:32:35.3741396Z     compiled=True,
2025-05-07T20:32:35.3741466Z )
2025-05-07T20:32:35.3741738Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3741910Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.3741915Z 
2025-05-07T20:32:35.3741986Z     @given(
2025-05-07T20:32:35.3742107Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3742205Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3742320Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3742435Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3742546Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3742621Z     )
2025-05-07T20:32:35.3742859Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3742950Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3743030Z         self,
2025-05-07T20:32:35.3743103Z         T: int,
2025-05-07T20:32:35.3743176Z         D: int,
2025-05-07T20:32:35.3743279Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3743367Z         contiguous: bool,
2025-05-07T20:32:35.3743449Z         compiled: bool,
2025-05-07T20:32:35.3743526Z     ) -> None:
2025-05-07T20:32:35.3743618Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3743688Z     
2025-05-07T20:32:35.3743854Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3743923Z     
2025-05-07T20:32:35.3744017Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3744145Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3744232Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3744313Z         x0 = x[:, :D]
2025-05-07T20:32:35.3744390Z         x1 = x[:, D:]
2025-05-07T20:32:35.3744461Z     
2025-05-07T20:32:35.3744542Z         if contiguous:
2025-05-07T20:32:35.3744630Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3744719Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3744795Z     
2025-05-07T20:32:35.3744886Z         if scale_ub is not None:
2025-05-07T20:32:35.3744990Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3745123Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3745194Z             )
2025-05-07T20:32:35.3745267Z         else:
2025-05-07T20:32:35.3745365Z             scale_ub_tensor = None
2025-05-07T20:32:35.3745454Z     
2025-05-07T20:32:35.3745601Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3745704Z             op = silu_mul_quant
2025-05-07T20:32:35.3745788Z             if compiled:
2025-05-07T20:32:35.3745888Z                 op = torch.compile(op)
2025-05-07T20:32:35.3745990Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3746064Z     
2025-05-07T20:32:35.3746158Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3746163Z 
2025-05-07T20:32:35.3746304Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3746431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3746574Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3746672Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3747033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3747123Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3747604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3747741Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3748091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3748308Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3748642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3748742Z     kernel = self.compile(
2025-05-07T20:32:35.3749184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3749357Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3749481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3749486Z 
2025-05-07T20:32:35.3749691Z self = <triton.compiler.compiler.ASTSource object at 0x7fb103338d70>
2025-05-07T20:32:35.3750451Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3750953Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb016dbd940>}
2025-05-07T20:32:35.3751688Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3751880Z context = <triton._C.libtriton.ir.context object at 0x7fb016edd1f0>
2025-05-07T20:32:35.3751885Z 
2025-05-07T20:32:35.3752045Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3752299Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3752410Z                            module_map=module_map)
2025-05-07T20:32:35.3752568Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3752663Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3752742Z E       ^
2025-05-07T20:32:35.3753090Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3753097Z 
2025-05-07T20:32:35.3753506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3753511Z 
2025-05-07T20:32:35.3753613Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3753828Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3753908Z     T=128,
2025-05-07T20:32:35.3753984Z     D=7168,
2025-05-07T20:32:35.3754062Z     scale_ub=1200.0,
2025-05-07T20:32:35.3754150Z     contiguous=False,
2025-05-07T20:32:35.3754231Z     compiled=True,
2025-05-07T20:32:35.3754306Z )
2025-05-07T20:32:35.3754518Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3754683Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.3754688Z 
2025-05-07T20:32:35.3754766Z     @given(
2025-05-07T20:32:35.3754883Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3755025Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3755177Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3755296Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3755411Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3755483Z     )
2025-05-07T20:32:35.3755723Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3755819Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3755892Z         self,
2025-05-07T20:32:35.3756007Z         T: int,
2025-05-07T20:32:35.3756083Z         D: int,
2025-05-07T20:32:35.3756179Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3756269Z         contiguous: bool,
2025-05-07T20:32:35.3756354Z         compiled: bool,
2025-05-07T20:32:35.3756429Z     ) -> None:
2025-05-07T20:32:35.3756522Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3756598Z     
2025-05-07T20:32:35.3756766Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3756839Z     
2025-05-07T20:32:35.3756935Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3757103Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3757194Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3757272Z         x0 = x[:, :D]
2025-05-07T20:32:35.3757347Z         x1 = x[:, D:]
2025-05-07T20:32:35.3757417Z     
2025-05-07T20:32:35.3757497Z         if contiguous:
2025-05-07T20:32:35.3757586Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3757676Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3757751Z     
2025-05-07T20:32:35.3757840Z         if scale_ub is not None:
2025-05-07T20:32:35.3757946Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3758077Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3758151Z             )
2025-05-07T20:32:35.3758231Z         else:
2025-05-07T20:32:35.3758323Z             scale_ub_tensor = None
2025-05-07T20:32:35.3758401Z     
2025-05-07T20:32:35.3758527Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3758620Z             op = silu_mul_quant
2025-05-07T20:32:35.3758710Z             if compiled:
2025-05-07T20:32:35.3758805Z                 op = torch.compile(op)
2025-05-07T20:32:35.3758912Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3758985Z     
2025-05-07T20:32:35.3759076Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3759080Z 
2025-05-07T20:32:35.3759175Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3759594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3759707Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3759804Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3760170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3760260Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3760749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3760849Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3761198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3761419Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3761748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3761843Z     kernel = self.compile(
2025-05-07T20:32:35.3762215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3762386Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3762512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3762603Z 
2025-05-07T20:32:35.3762804Z self = <triton.compiler.compiler.ASTSource object at 0x7fb10319e0c0>
2025-05-07T20:32:35.3763626Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3764124Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb016dbe700>}
2025-05-07T20:32:35.3764911Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3765101Z context = <triton._C.libtriton.ir.context object at 0x7fb016ecbeb0>
2025-05-07T20:32:35.3765106Z 
2025-05-07T20:32:35.3765267Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3765530Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3765693Z                            module_map=module_map)
2025-05-07T20:32:35.3765852Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3765950Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3766026Z E       ^
2025-05-07T20:32:35.3766369Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3766382Z 
2025-05-07T20:32:35.3766788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3766793Z 
2025-05-07T20:32:35.3766891Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3767111Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3767189Z     T=2048,
2025-05-07T20:32:35.3767265Z     D=7168,
2025-05-07T20:32:35.3767343Z     scale_ub=None,
2025-05-07T20:32:35.3767423Z     contiguous=True,
2025-05-07T20:32:35.3767506Z     compiled=True,
2025-05-07T20:32:35.3767584Z )
2025-05-07T20:32:35.3767798Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3767968Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.3767972Z 
2025-05-07T20:32:35.3768047Z     @given(
2025-05-07T20:32:35.3768163Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3768266Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3768378Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3768495Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3768608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3768684Z     )
2025-05-07T20:32:35.3768923Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3769020Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3769096Z         self,
2025-05-07T20:32:35.3769173Z         T: int,
2025-05-07T20:32:35.3769250Z         D: int,
2025-05-07T20:32:35.3769346Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3769437Z         contiguous: bool,
2025-05-07T20:32:35.3769524Z         compiled: bool,
2025-05-07T20:32:35.3769601Z     ) -> None:
2025-05-07T20:32:35.3769696Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3769766Z     
2025-05-07T20:32:35.3769930Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3770011Z     
2025-05-07T20:32:35.3770098Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3770221Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3770309Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3770386Z         x0 = x[:, :D]
2025-05-07T20:32:35.3770467Z         x1 = x[:, D:]
2025-05-07T20:32:35.3770537Z     
2025-05-07T20:32:35.3770668Z         if contiguous:
2025-05-07T20:32:35.3770761Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3770849Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3770963Z     
2025-05-07T20:32:35.3771056Z         if scale_ub is not None:
2025-05-07T20:32:35.3771163Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3771294Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3771373Z             )
2025-05-07T20:32:35.3771446Z         else:
2025-05-07T20:32:35.3771539Z             scale_ub_tensor = None
2025-05-07T20:32:35.3771610Z     
2025-05-07T20:32:35.3771777Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3771871Z             op = silu_mul_quant
2025-05-07T20:32:35.3771952Z             if compiled:
2025-05-07T20:32:35.3772048Z                 op = torch.compile(op)
2025-05-07T20:32:35.3772153Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3772226Z     
2025-05-07T20:32:35.3772313Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3772320Z 
2025-05-07T20:32:35.3772415Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3772588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3772690Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3772797Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3773224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3773321Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3773804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3773908Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3774257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3774475Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3774807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3774903Z     kernel = self.compile(
2025-05-07T20:32:35.3775278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3775478Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3775618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3775624Z 
2025-05-07T20:32:35.3775831Z self = <triton.compiler.compiler.ASTSource object at 0x7fb10319e780>
2025-05-07T20:32:35.3776593Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3777087Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb016dbf7e0>}
2025-05-07T20:32:35.3777821Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3778007Z context = <triton._C.libtriton.ir.context object at 0x7fb016c58cb0>
2025-05-07T20:32:35.3778011Z 
2025-05-07T20:32:35.3778176Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3778434Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3778538Z                            module_map=module_map)
2025-05-07T20:32:35.3778699Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3778795Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3778869Z E       ^
2025-05-07T20:32:35.3779266Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3779271Z 
2025-05-07T20:32:35.3779714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3779719Z 
2025-05-07T20:32:35.3779823Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3780040Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3780115Z     T=16384,
2025-05-07T20:32:35.3780193Z     D=5120,
2025-05-07T20:32:35.3780335Z     scale_ub=None,
2025-05-07T20:32:35.3780417Z     contiguous=False,
2025-05-07T20:32:35.3780501Z     compiled=False,
2025-05-07T20:32:35.3780572Z )
2025-05-07T20:32:35.3780787Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3780962Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.3780967Z 
2025-05-07T20:32:35.3781042Z     @given(
2025-05-07T20:32:35.3781163Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3781263Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3781419Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3781541Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3781651Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3781721Z     )
2025-05-07T20:32:35.3781963Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3782055Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3782134Z         self,
2025-05-07T20:32:35.3782213Z         T: int,
2025-05-07T20:32:35.3782290Z         D: int,
2025-05-07T20:32:35.3782390Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3782476Z         contiguous: bool,
2025-05-07T20:32:35.3782559Z         compiled: bool,
2025-05-07T20:32:35.3782637Z     ) -> None:
2025-05-07T20:32:35.3782727Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3782801Z     
2025-05-07T20:32:35.3782970Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3783045Z     
2025-05-07T20:32:35.3783136Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3783263Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3785045Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3785054Z 
2025-05-07T20:32:35.3785172Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:35.3785179Z 
2025-05-07T20:32:35.3785277Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3785504Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3785580Z     T=4096,
2025-05-07T20:32:35.3785650Z     D=7168,
2025-05-07T20:32:35.3785732Z     scale_ub=1200.0,
2025-05-07T20:32:35.3785814Z     contiguous=True,
2025-05-07T20:32:35.3785895Z     compiled=True,
2025-05-07T20:32:35.3785965Z )
2025-05-07T20:32:35.3786177Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3786345Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.3786350Z 
2025-05-07T20:32:35.3786432Z     @given(
2025-05-07T20:32:35.3786544Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3786645Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3786755Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3786913Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3787026Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3787097Z     )
2025-05-07T20:32:35.3787376Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3787470Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3787540Z         self,
2025-05-07T20:32:35.3787616Z         T: int,
2025-05-07T20:32:35.3787693Z         D: int,
2025-05-07T20:32:35.3787788Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3787873Z         contiguous: bool,
2025-05-07T20:32:35.3788004Z         compiled: bool,
2025-05-07T20:32:35.3788080Z     ) -> None:
2025-05-07T20:32:35.3788175Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3788245Z     
2025-05-07T20:32:35.3788408Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3788480Z     
2025-05-07T20:32:35.3788570Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3788693Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3790498Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3790507Z 
2025-05-07T20:32:35.3790622Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:35.3790627Z 
2025-05-07T20:32:35.3790730Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3790950Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3791031Z     T=16384,
2025-05-07T20:32:35.3791113Z     D=7168,
2025-05-07T20:32:35.3791193Z     scale_ub=None,
2025-05-07T20:32:35.3791275Z     contiguous=False,
2025-05-07T20:32:35.3791357Z     compiled=False,
2025-05-07T20:32:35.3791428Z )
2025-05-07T20:32:35.3791646Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3791816Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.3791821Z 
2025-05-07T20:32:35.3791897Z     @given(
2025-05-07T20:32:35.3792018Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3792114Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3792227Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3792342Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3792453Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3792527Z     )
2025-05-07T20:32:35.3792767Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3792861Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3792941Z         self,
2025-05-07T20:32:35.3793013Z         T: int,
2025-05-07T20:32:35.3793087Z         D: int,
2025-05-07T20:32:35.3793186Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3793271Z         contiguous: bool,
2025-05-07T20:32:35.3793355Z         compiled: bool,
2025-05-07T20:32:35.3793434Z     ) -> None:
2025-05-07T20:32:35.3793528Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3793601Z     
2025-05-07T20:32:35.3793767Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3795568Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3795616Z 
2025-05-07T20:32:35.3795733Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.3795738Z 
2025-05-07T20:32:35.3795834Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3796053Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3796126Z     T=2048,
2025-05-07T20:32:35.3796200Z     D=7168,
2025-05-07T20:32:35.3796280Z     scale_ub=1200.0,
2025-05-07T20:32:35.3796404Z     contiguous=True,
2025-05-07T20:32:35.3796484Z     compiled=True,
2025-05-07T20:32:35.3796556Z )
2025-05-07T20:32:35.3796768Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3796933Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.3796941Z 
2025-05-07T20:32:35.3797016Z     @given(
2025-05-07T20:32:35.3797132Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3797229Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3797394Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3797511Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3797623Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3797697Z     )
2025-05-07T20:32:35.3797934Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3798027Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3798105Z         self,
2025-05-07T20:32:35.3798177Z         T: int,
2025-05-07T20:32:35.3798257Z         D: int,
2025-05-07T20:32:35.3798353Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3798445Z         contiguous: bool,
2025-05-07T20:32:35.3798531Z         compiled: bool,
2025-05-07T20:32:35.3798604Z     ) -> None:
2025-05-07T20:32:35.3798699Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3798777Z     
2025-05-07T20:32:35.3798938Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3799013Z     
2025-05-07T20:32:35.3799109Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3799231Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3800974Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3800983Z 
2025-05-07T20:32:35.3801097Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:35.3801104Z 
2025-05-07T20:32:35.3801206Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3801423Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3801503Z     T=2048,
2025-05-07T20:32:35.3801575Z     D=7168,
2025-05-07T20:32:35.3801653Z     scale_ub=None,
2025-05-07T20:32:35.3801739Z     contiguous=True,
2025-05-07T20:32:35.3801820Z     compiled=False,
2025-05-07T20:32:35.3801889Z )
2025-05-07T20:32:35.3802102Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3802268Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.3802275Z 
2025-05-07T20:32:35.3802349Z     @given(
2025-05-07T20:32:35.3802466Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3802563Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3802675Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3802787Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3802941Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3803015Z     )
2025-05-07T20:32:35.3803289Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3803381Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3803457Z         self,
2025-05-07T20:32:35.3803530Z         T: int,
2025-05-07T20:32:35.3803605Z         D: int,
2025-05-07T20:32:35.3803704Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3803791Z         contiguous: bool,
2025-05-07T20:32:35.3803875Z         compiled: bool,
2025-05-07T20:32:35.3803991Z     ) -> None:
2025-05-07T20:32:35.3804083Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3804158Z     
2025-05-07T20:32:35.3804319Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3804390Z     
2025-05-07T20:32:35.3804481Z >       x_sign = torch.sign(x)
2025-05-07T20:32:35.3806259Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3806268Z 
2025-05-07T20:32:35.3806386Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:35.3806394Z 
2025-05-07T20:32:35.3806491Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3806707Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3806784Z     T=1,
2025-05-07T20:32:35.3806858Z     D=7168,
2025-05-07T20:32:35.3806939Z     scale_ub=1200.0,
2025-05-07T20:32:35.3807022Z     contiguous=True,
2025-05-07T20:32:35.3807105Z     compiled=False,
2025-05-07T20:32:35.3807181Z )
2025-05-07T20:32:35.3807393Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3807557Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.3807562Z 
2025-05-07T20:32:35.3807636Z     @given(
2025-05-07T20:32:35.3807748Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3807846Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3807960Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3808077Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3808185Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3808264Z     )
2025-05-07T20:32:35.3808503Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3808594Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3808667Z         self,
2025-05-07T20:32:35.3808740Z         T: int,
2025-05-07T20:32:35.3808818Z         D: int,
2025-05-07T20:32:35.3808914Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3809001Z         contiguous: bool,
2025-05-07T20:32:35.3809091Z         compiled: bool,
2025-05-07T20:32:35.3809167Z     ) -> None:
2025-05-07T20:32:35.3809257Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3809328Z     
2025-05-07T20:32:35.3809492Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3809566Z     
2025-05-07T20:32:35.3809658Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3813911Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3814019Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3814105Z         x0 = x[:, :D]
2025-05-07T20:32:35.3814182Z         x1 = x[:, D:]
2025-05-07T20:32:35.3814257Z     
2025-05-07T20:32:35.3814340Z         if contiguous:
2025-05-07T20:32:35.3814432Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3814523Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3814670Z     
2025-05-07T20:32:35.3814763Z         if scale_ub is not None:
2025-05-07T20:32:35.3814873Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3815081Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3815156Z             )
2025-05-07T20:32:35.3815235Z         else:
2025-05-07T20:32:35.3815328Z             scale_ub_tensor = None
2025-05-07T20:32:35.3815404Z     
2025-05-07T20:32:35.3815536Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3815650Z             op = silu_mul_quant
2025-05-07T20:32:35.3815797Z             if compiled:
2025-05-07T20:32:35.3815909Z                 op = torch.compile(op)
2025-05-07T20:32:35.3816012Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3816088Z     
2025-05-07T20:32:35.3816180Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3816185Z 
2025-05-07T20:32:35.3816284Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3816417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3816520Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3816622Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3817177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3817276Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3817637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3817858Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3818195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3818290Z     kernel = self.compile(
2025-05-07T20:32:35.3818668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3818848Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3818978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3818985Z 
2025-05-07T20:32:35.3819187Z self = <triton.compiler.compiler.ASTSource object at 0x7fb01750b740>
2025-05-07T20:32:35.3819956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3820457Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb016cd2b60>}
2025-05-07T20:32:35.3821192Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3821383Z context = <triton._C.libtriton.ir.context object at 0x7fb016cec970>
2025-05-07T20:32:35.3821387Z 
2025-05-07T20:32:35.3821555Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3821815Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3821922Z                            module_map=module_map)
2025-05-07T20:32:35.3822087Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3822184Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3822263Z E       ^
2025-05-07T20:32:35.3822616Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3822621Z 
2025-05-07T20:32:35.3823026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3823030Z 
2025-05-07T20:32:35.3823179Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3823400Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3823516Z     T=128,
2025-05-07T20:32:35.3823601Z     D=5120,
2025-05-07T20:32:35.3823682Z     scale_ub=None,
2025-05-07T20:32:35.3823766Z     contiguous=True,
2025-05-07T20:32:35.3823851Z     compiled=False,
2025-05-07T20:32:35.3823926Z )
2025-05-07T20:32:35.3824139Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3824310Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.3824354Z 
2025-05-07T20:32:35.3824432Z     @given(
2025-05-07T20:32:35.3824551Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3824658Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3824773Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3824892Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3825006Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3825081Z     )
2025-05-07T20:32:35.3825373Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3825466Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3825542Z         self,
2025-05-07T20:32:35.3825625Z         T: int,
2025-05-07T20:32:35.3825699Z         D: int,
2025-05-07T20:32:35.3825807Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3825912Z         contiguous: bool,
2025-05-07T20:32:35.3826010Z         compiled: bool,
2025-05-07T20:32:35.3826100Z     ) -> None:
2025-05-07T20:32:35.3826198Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3826274Z     
2025-05-07T20:32:35.3826445Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3826518Z     
2025-05-07T20:32:35.3826611Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3826740Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3826826Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3826906Z         x0 = x[:, :D]
2025-05-07T20:32:35.3826988Z         x1 = x[:, D:]
2025-05-07T20:32:35.3827061Z     
2025-05-07T20:32:35.3827150Z         if contiguous:
2025-05-07T20:32:35.3827248Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3827338Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3827411Z     
2025-05-07T20:32:35.3827504Z         if scale_ub is not None:
2025-05-07T20:32:35.3827610Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3827745Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3827820Z             )
2025-05-07T20:32:35.3827897Z         else:
2025-05-07T20:32:35.3827996Z             scale_ub_tensor = None
2025-05-07T20:32:35.3828066Z     
2025-05-07T20:32:35.3828193Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3828289Z             op = silu_mul_quant
2025-05-07T20:32:35.3828375Z             if compiled:
2025-05-07T20:32:35.3828472Z                 op = torch.compile(op)
2025-05-07T20:32:35.3828581Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3828651Z     
2025-05-07T20:32:35.3828743Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3828752Z 
2025-05-07T20:32:35.3828850Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3828977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3829077Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3829174Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3829666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3829767Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3830119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3830340Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3830723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3830854Z     kernel = self.compile(
2025-05-07T20:32:35.3831232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3831401Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3831522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3831527Z 
2025-05-07T20:32:35.3831731Z self = <triton.compiler.compiler.ASTSource object at 0x7fb0175099a0>
2025-05-07T20:32:35.3832530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3833028Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb016cd3c40>}
2025-05-07T20:32:35.3833807Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3833999Z context = <triton._C.libtriton.ir.context object at 0x7fb016b86130>
2025-05-07T20:32:35.3834003Z 
2025-05-07T20:32:35.3834164Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3834645Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3834758Z                            module_map=module_map)
2025-05-07T20:32:35.3834917Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3835011Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3835090Z E       ^
2025-05-07T20:32:35.3835435Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3835443Z 
2025-05-07T20:32:35.3835855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3835860Z 
2025-05-07T20:32:35.3835958Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3836175Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3836251Z     T=128,
2025-05-07T20:32:35.3836326Z     D=7168,
2025-05-07T20:32:35.3836405Z     scale_ub=None,
2025-05-07T20:32:35.3836496Z     contiguous=True,
2025-05-07T20:32:35.3836580Z     compiled=False,
2025-05-07T20:32:35.3836656Z )
2025-05-07T20:32:35.3836869Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3837034Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.3837040Z 
2025-05-07T20:32:35.3837121Z     @given(
2025-05-07T20:32:35.3837242Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3837340Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3837459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3837572Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3837682Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3837757Z     )
2025-05-07T20:32:35.3837996Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3838089Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3838166Z         self,
2025-05-07T20:32:35.3838240Z         T: int,
2025-05-07T20:32:35.3838319Z         D: int,
2025-05-07T20:32:35.3838417Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3838503Z         contiguous: bool,
2025-05-07T20:32:35.3838589Z         compiled: bool,
2025-05-07T20:32:35.3838665Z     ) -> None:
2025-05-07T20:32:35.3838755Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3838879Z     
2025-05-07T20:32:35.3839044Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3839117Z     
2025-05-07T20:32:35.3839247Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3839374Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3839466Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3839544Z         x0 = x[:, :D]
2025-05-07T20:32:35.3839621Z         x1 = x[:, D:]
2025-05-07T20:32:35.3839696Z     
2025-05-07T20:32:35.3839775Z         if contiguous:
2025-05-07T20:32:35.3839864Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3839996Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3840065Z     
2025-05-07T20:32:35.3840158Z         if scale_ub is not None:
2025-05-07T20:32:35.3840265Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3840395Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3840465Z             )
2025-05-07T20:32:35.3840541Z         else:
2025-05-07T20:32:35.3840635Z             scale_ub_tensor = None
2025-05-07T20:32:35.3840705Z     
2025-05-07T20:32:35.3840839Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3840971Z             op = silu_mul_quant
2025-05-07T20:32:35.3841064Z             if compiled:
2025-05-07T20:32:35.3841160Z                 op = torch.compile(op)
2025-05-07T20:32:35.3841263Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3841339Z     
2025-05-07T20:32:35.3841428Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3841433Z 
2025-05-07T20:32:35.3841526Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3841660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3841760Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3841859Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3842355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3842458Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3842815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3843033Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3843364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3843454Z     kernel = self.compile(
2025-05-07T20:32:35.3843827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3844004Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3844128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3844132Z 
2025-05-07T20:32:35.3844331Z self = <triton.compiler.compiler.ASTSource object at 0x7fb016ba1850>
2025-05-07T20:32:35.3845100Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3845623Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb016b70ae0>}
2025-05-07T20:32:35.3846383Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3846572Z context = <triton._C.libtriton.ir.context object at 0x7fb016a65db0>
2025-05-07T20:32:35.3846577Z 
2025-05-07T20:32:35.3846738Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3846995Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3847170Z                            module_map=module_map)
2025-05-07T20:32:35.3847370Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3847470Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3847542Z E       ^
2025-05-07T20:32:35.3847891Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3847896Z 
2025-05-07T20:32:35.3848301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3848344Z 
2025-05-07T20:32:35.3848446Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3848664Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3848739Z     T=2048,
2025-05-07T20:32:35.3848818Z     D=7168,
2025-05-07T20:32:35.3848899Z     scale_ub=1200.0,
2025-05-07T20:32:35.3848980Z     contiguous=True,
2025-05-07T20:32:35.3849069Z     compiled=False,
2025-05-07T20:32:35.3849143Z )
2025-05-07T20:32:35.3849361Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3849579Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.3849584Z 
2025-05-07T20:32:35.3849656Z     @given(
2025-05-07T20:32:35.3849773Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3849871Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3849984Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3850102Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3850213Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3850284Z     )
2025-05-07T20:32:35.3850526Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3850618Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3850691Z         self,
2025-05-07T20:32:35.3850771Z         T: int,
2025-05-07T20:32:35.3850845Z         D: int,
2025-05-07T20:32:35.3850941Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3851032Z         contiguous: bool,
2025-05-07T20:32:35.3851119Z         compiled: bool,
2025-05-07T20:32:35.3851201Z     ) -> None:
2025-05-07T20:32:35.3851292Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3851362Z     
2025-05-07T20:32:35.3851531Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3853386Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3853397Z 
2025-05-07T20:32:35.3853517Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.3853523Z 
2025-05-07T20:32:35.3853623Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3853841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3853917Z     T=1,
2025-05-07T20:32:35.3853992Z     D=5120,
2025-05-07T20:32:35.3854070Z     scale_ub=1200.0,
2025-05-07T20:32:35.3854155Z     contiguous=True,
2025-05-07T20:32:35.3854236Z     compiled=False,
2025-05-07T20:32:35.3854314Z )
2025-05-07T20:32:35.3854526Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3854688Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.3854692Z 
2025-05-07T20:32:35.3854773Z     @given(
2025-05-07T20:32:35.3854890Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3854988Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3855149Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3855300Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3855412Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3855489Z     )
2025-05-07T20:32:35.3855729Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3855821Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3855898Z         self,
2025-05-07T20:32:35.3855971Z         T: int,
2025-05-07T20:32:35.3856049Z         D: int,
2025-05-07T20:32:35.3856196Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3856283Z         contiguous: bool,
2025-05-07T20:32:35.3856371Z         compiled: bool,
2025-05-07T20:32:35.3856446Z     ) -> None:
2025-05-07T20:32:35.3856536Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3856606Z     
2025-05-07T20:32:35.3856767Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3856842Z     
2025-05-07T20:32:35.3856932Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3857058Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3857194Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3857273Z         x0 = x[:, :D]
2025-05-07T20:32:35.3857348Z         x1 = x[:, D:]
2025-05-07T20:32:35.3857425Z     
2025-05-07T20:32:35.3857506Z         if contiguous:
2025-05-07T20:32:35.3857594Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3857683Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3857752Z     
2025-05-07T20:32:35.3857844Z         if scale_ub is not None:
2025-05-07T20:32:35.3857951Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3858082Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3858156Z             )
2025-05-07T20:32:35.3858234Z         else:
2025-05-07T20:32:35.3858326Z             scale_ub_tensor = None
2025-05-07T20:32:35.3858400Z     
2025-05-07T20:32:35.3858527Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3858619Z             op = silu_mul_quant
2025-05-07T20:32:35.3858704Z             if compiled:
2025-05-07T20:32:35.3858806Z                 op = torch.compile(op)
2025-05-07T20:32:35.3858908Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3858976Z     
2025-05-07T20:32:35.3859064Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3859069Z 
2025-05-07T20:32:35.3859161Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3859547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3859668Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3859772Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3860262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3860356Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3860711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3860935Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3861269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3861364Z     kernel = self.compile(
2025-05-07T20:32:35.3861737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3861910Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3862038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3862043Z 
2025-05-07T20:32:35.3862247Z self = <triton.compiler.compiler.ASTSource object at 0x7fb016ba1400>
2025-05-07T20:32:35.3863009Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3863656Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb016b720c0>}
2025-05-07T20:32:35.3864390Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3864574Z context = <triton._C.libtriton.ir.context object at 0x7fb016ac3ef0>
2025-05-07T20:32:35.3864635Z 
2025-05-07T20:32:35.3864799Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3865058Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3865163Z                            module_map=module_map)
2025-05-07T20:32:35.3865328Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3865424Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3865498Z E       ^
2025-05-07T20:32:35.3865912Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3865917Z 
2025-05-07T20:32:35.3866322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3866327Z 
2025-05-07T20:32:35.3866429Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3866650Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3866724Z     T=2048,
2025-05-07T20:32:35.3866802Z     D=5120,
2025-05-07T20:32:35.3866882Z     scale_ub=None,
2025-05-07T20:32:35.3866964Z     contiguous=True,
2025-05-07T20:32:35.3867050Z     compiled=False,
2025-05-07T20:32:35.3867121Z )
2025-05-07T20:32:35.3867333Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3867506Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.3867513Z 
2025-05-07T20:32:35.3867590Z     @given(
2025-05-07T20:32:35.3867709Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3867805Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3867917Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3868032Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3868144Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3868221Z     )
2025-05-07T20:32:35.3868462Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3868552Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3868628Z         self,
2025-05-07T20:32:35.3868706Z         T: int,
2025-05-07T20:32:35.3868782Z         D: int,
2025-05-07T20:32:35.3868881Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3868969Z         contiguous: bool,
2025-05-07T20:32:35.3869051Z         compiled: bool,
2025-05-07T20:32:35.3869129Z     ) -> None:
2025-05-07T20:32:35.3869228Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3869298Z     
2025-05-07T20:32:35.3869465Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3869534Z     
2025-05-07T20:32:35.3869623Z >       x_sign = torch.sign(x)
2025-05-07T20:32:35.3871375Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3871430Z 
2025-05-07T20:32:35.3871547Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:35.3871552Z 
2025-05-07T20:32:35.3871695Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3871913Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3871997Z     T=16384,
2025-05-07T20:32:35.3872074Z     D=5120,
2025-05-07T20:32:35.3872154Z     scale_ub=None,
2025-05-07T20:32:35.3872239Z     contiguous=True,
2025-05-07T20:32:35.3872320Z     compiled=False,
2025-05-07T20:32:35.3872389Z )
2025-05-07T20:32:35.3872642Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3872813Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.3872817Z 
2025-05-07T20:32:35.3872890Z     @given(
2025-05-07T20:32:35.3873009Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3873106Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3873220Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3873335Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3873512Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3873591Z     )
2025-05-07T20:32:35.3873832Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3873930Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3874003Z         self,
2025-05-07T20:32:35.3874081Z         T: int,
2025-05-07T20:32:35.3874160Z         D: int,
2025-05-07T20:32:35.3874254Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3874343Z         contiguous: bool,
2025-05-07T20:32:35.3874429Z         compiled: bool,
2025-05-07T20:32:35.3874509Z     ) -> None:
2025-05-07T20:32:35.3874601Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3874674Z     
2025-05-07T20:32:35.3874836Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3876640Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3876649Z 
2025-05-07T20:32:35.3876771Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.3876776Z 
2025-05-07T20:32:35.3876879Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3877093Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3877168Z     T=4096,
2025-05-07T20:32:35.3877247Z     D=5120,
2025-05-07T20:32:35.3877324Z     scale_ub=None,
2025-05-07T20:32:35.3877409Z     contiguous=True,
2025-05-07T20:32:35.3877494Z     compiled=False,
2025-05-07T20:32:35.3877563Z )
2025-05-07T20:32:35.3877779Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3877951Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.3877956Z 
2025-05-07T20:32:35.3878027Z     @given(
2025-05-07T20:32:35.3878147Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3878245Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3878356Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3878474Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3878586Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3878657Z     )
2025-05-07T20:32:35.3878899Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3878989Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3879067Z         self,
2025-05-07T20:32:35.3879187Z         T: int,
2025-05-07T20:32:35.3879259Z         D: int,
2025-05-07T20:32:35.3879357Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3879485Z         contiguous: bool,
2025-05-07T20:32:35.3879570Z         compiled: bool,
2025-05-07T20:32:35.3879649Z     ) -> None:
2025-05-07T20:32:35.3879741Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3879814Z     
2025-05-07T20:32:35.3879983Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3881721Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3881768Z 
2025-05-07T20:32:35.3881888Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.3881931Z 
2025-05-07T20:32:35.3882031Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3882250Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3882326Z     T=2048,
2025-05-07T20:32:35.3882402Z     D=5120,
2025-05-07T20:32:35.3882483Z     scale_ub=None,
2025-05-07T20:32:35.3882568Z     contiguous=False,
2025-05-07T20:32:35.3882651Z     compiled=False,
2025-05-07T20:32:35.3882725Z )
2025-05-07T20:32:35.3882936Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3883103Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.3883107Z 
2025-05-07T20:32:35.3883184Z     @given(
2025-05-07T20:32:35.3883299Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3883398Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3883513Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3883629Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3883741Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3883810Z     )
2025-05-07T20:32:35.3884050Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3884142Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3884217Z         self,
2025-05-07T20:32:35.3884294Z         T: int,
2025-05-07T20:32:35.3884375Z         D: int,
2025-05-07T20:32:35.3884471Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3884556Z         contiguous: bool,
2025-05-07T20:32:35.3884642Z         compiled: bool,
2025-05-07T20:32:35.3884719Z     ) -> None:
2025-05-07T20:32:35.3884810Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3884882Z     
2025-05-07T20:32:35.3885044Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3886784Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3886792Z 
2025-05-07T20:32:35.3886906Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.3886911Z 
2025-05-07T20:32:35.3887013Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3887228Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3887303Z     T=4096,
2025-05-07T20:32:35.3887421Z     D=7168,
2025-05-07T20:32:35.3887501Z     scale_ub=None,
2025-05-07T20:32:35.3887585Z     contiguous=True,
2025-05-07T20:32:35.3887664Z     compiled=True,
2025-05-07T20:32:35.3887772Z )
2025-05-07T20:32:35.3887988Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3888156Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.3888160Z 
2025-05-07T20:32:35.3888233Z     @given(
2025-05-07T20:32:35.3888352Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3888449Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3888598Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3888716Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3888828Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3888900Z     )
2025-05-07T20:32:35.3889143Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3889236Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3889315Z         self,
2025-05-07T20:32:35.3889390Z         T: int,
2025-05-07T20:32:35.3889463Z         D: int,
2025-05-07T20:32:35.3889601Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3889690Z         contiguous: bool,
2025-05-07T20:32:35.3889773Z         compiled: bool,
2025-05-07T20:32:35.3889852Z     ) -> None:
2025-05-07T20:32:35.3889943Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3890011Z     
2025-05-07T20:32:35.3890178Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3891919Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3891930Z 
2025-05-07T20:32:35.3892049Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.3892053Z 
2025-05-07T20:32:35.3892151Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3892374Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3892448Z     T=2048,
2025-05-07T20:32:35.3892521Z     D=5120,
2025-05-07T20:32:35.3892603Z     scale_ub=1200.0,
2025-05-07T20:32:35.3892690Z     contiguous=False,
2025-05-07T20:32:35.3892770Z     compiled=False,
2025-05-07T20:32:35.3892845Z )
2025-05-07T20:32:35.3893110Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3893280Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.3893285Z 
2025-05-07T20:32:35.3893360Z     @given(
2025-05-07T20:32:35.3893477Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3893572Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3893689Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3893800Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3893913Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3893983Z     )
2025-05-07T20:32:35.3894220Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3894312Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3894387Z         self,
2025-05-07T20:32:35.3894459Z         T: int,
2025-05-07T20:32:35.3894536Z         D: int,
2025-05-07T20:32:35.3894630Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3894716Z         contiguous: bool,
2025-05-07T20:32:35.3894803Z         compiled: bool,
2025-05-07T20:32:35.3894877Z     ) -> None:
2025-05-07T20:32:35.3894971Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3895093Z     
2025-05-07T20:32:35.3895255Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3897080Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3897123Z 
2025-05-07T20:32:35.3897237Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.3897241Z 
2025-05-07T20:32:35.3897343Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3897559Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3897633Z     T=4096,
2025-05-07T20:32:35.3897710Z     D=7168,
2025-05-07T20:32:35.3897791Z     scale_ub=1200.0,
2025-05-07T20:32:35.3897877Z     contiguous=True,
2025-05-07T20:32:35.3897997Z     compiled=False,
2025-05-07T20:32:35.3898071Z )
2025-05-07T20:32:35.3898279Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3898452Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.3898456Z 
2025-05-07T20:32:35.3898530Z     @given(
2025-05-07T20:32:35.3898644Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3898743Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3898853Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3898972Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3899083Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3899156Z     )
2025-05-07T20:32:35.3899400Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3899490Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3899570Z         self,
2025-05-07T20:32:35.3899650Z         T: int,
2025-05-07T20:32:35.3899723Z         D: int,
2025-05-07T20:32:35.3899819Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3899906Z         contiguous: bool,
2025-05-07T20:32:35.3899987Z         compiled: bool,
2025-05-07T20:32:35.3900066Z     ) -> None:
2025-05-07T20:32:35.3900156Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3900224Z     
2025-05-07T20:32:35.3900389Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3902132Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3902140Z 
2025-05-07T20:32:35.3902256Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.3902260Z 
2025-05-07T20:32:35.3902357Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3902575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3902648Z     T=16384,
2025-05-07T20:32:35.3902723Z     D=7168,
2025-05-07T20:32:35.3902804Z     scale_ub=None,
2025-05-07T20:32:35.3902887Z     contiguous=False,
2025-05-07T20:32:35.3902968Z     compiled=True,
2025-05-07T20:32:35.3903042Z )
2025-05-07T20:32:35.3903253Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3903423Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.3903472Z 
2025-05-07T20:32:35.3903552Z     @given(
2025-05-07T20:32:35.3903666Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3903805Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3903920Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3904032Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3904148Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3904221Z     )
2025-05-07T20:32:35.3904460Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3904614Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3904689Z         self,
2025-05-07T20:32:35.3904765Z         T: int,
2025-05-07T20:32:35.3904845Z         D: int,
2025-05-07T20:32:35.3904941Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3905027Z         contiguous: bool,
2025-05-07T20:32:35.3905113Z         compiled: bool,
2025-05-07T20:32:35.3905185Z     ) -> None:
2025-05-07T20:32:35.3905279Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3905355Z     
2025-05-07T20:32:35.3905558Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3907299Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3907308Z 
2025-05-07T20:32:35.3907420Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.3907424Z 
2025-05-07T20:32:35.3907523Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3907742Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3907814Z     T=4096,
2025-05-07T20:32:35.3907890Z     D=7168,
2025-05-07T20:32:35.3907975Z     scale_ub=None,
2025-05-07T20:32:35.3908054Z     contiguous=True,
2025-05-07T20:32:35.3908137Z     compiled=False,
2025-05-07T20:32:35.3908210Z )
2025-05-07T20:32:35.3908419Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3908586Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.3908591Z 
2025-05-07T20:32:35.3908667Z     @given(
2025-05-07T20:32:35.3908784Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3908879Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3908989Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3909104Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3909213Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3909285Z     )
2025-05-07T20:32:35.3909526Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3909618Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3909697Z         self,
2025-05-07T20:32:35.3909774Z         T: int,
2025-05-07T20:32:35.3909850Z         D: int,
2025-05-07T20:32:35.3909948Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3910034Z         contiguous: bool,
2025-05-07T20:32:35.3910116Z         compiled: bool,
2025-05-07T20:32:35.3910198Z     ) -> None:
2025-05-07T20:32:35.3910289Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3910364Z     
2025-05-07T20:32:35.3910528Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3912305Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3912353Z 
2025-05-07T20:32:35.3912471Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.3912475Z 
2025-05-07T20:32:35.3912572Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3912790Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3912902Z     T=16384,
2025-05-07T20:32:35.3912977Z     D=7168,
2025-05-07T20:32:35.3913063Z     scale_ub=None,
2025-05-07T20:32:35.3913145Z     contiguous=True,
2025-05-07T20:32:35.3913229Z     compiled=False,
2025-05-07T20:32:35.3913305Z )
2025-05-07T20:32:35.3913514Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3913687Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.3913691Z 
2025-05-07T20:32:35.3913768Z     @given(
2025-05-07T20:32:35.3913922Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3914019Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3914134Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3914247Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3914360Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3914432Z     )
2025-05-07T20:32:35.3914676Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3914774Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3914847Z         self,
2025-05-07T20:32:35.3914922Z         T: int,
2025-05-07T20:32:35.3914999Z         D: int,
2025-05-07T20:32:35.3915095Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3915179Z         contiguous: bool,
2025-05-07T20:32:35.3915268Z         compiled: bool,
2025-05-07T20:32:35.3915345Z     ) -> None:
2025-05-07T20:32:35.3915436Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3915513Z     
2025-05-07T20:32:35.3915678Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3917423Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3917431Z 
2025-05-07T20:32:35.3917545Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.3917550Z 
2025-05-07T20:32:35.3917657Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3917871Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3917948Z     T=16384,
2025-05-07T20:32:35.3918025Z     D=7168,
2025-05-07T20:32:35.3918104Z     scale_ub=1200.0,
2025-05-07T20:32:35.3918185Z     contiguous=True,
2025-05-07T20:32:35.3918265Z     compiled=False,
2025-05-07T20:32:35.3918337Z )
2025-05-07T20:32:35.3918546Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3918720Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.3918726Z 
2025-05-07T20:32:35.3918801Z     @given(
2025-05-07T20:32:35.3918917Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3919015Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3919125Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3919238Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3919393Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3919466Z     )
2025-05-07T20:32:35.3919750Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3919842Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3919918Z         self,
2025-05-07T20:32:35.3919993Z         T: int,
2025-05-07T20:32:35.3920066Z         D: int,
2025-05-07T20:32:35.3920165Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3920253Z         contiguous: bool,
2025-05-07T20:32:35.3920335Z         compiled: bool,
2025-05-07T20:32:35.3920457Z     ) -> None:
2025-05-07T20:32:35.3920546Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3920620Z     
2025-05-07T20:32:35.3920786Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3922565Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3922574Z 
2025-05-07T20:32:35.3922690Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.3922695Z 
2025-05-07T20:32:35.3922794Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3923018Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3923093Z     T=128,
2025-05-07T20:32:35.3923166Z     D=5120,
2025-05-07T20:32:35.3923249Z     scale_ub=1200.0,
2025-05-07T20:32:35.3923335Z     contiguous=False,
2025-05-07T20:32:35.3923418Z     compiled=False,
2025-05-07T20:32:35.3923488Z )
2025-05-07T20:32:35.3923697Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3923865Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.3923874Z 
2025-05-07T20:32:35.3923949Z     @given(
2025-05-07T20:32:35.3924063Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3924159Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3924272Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3924383Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3924496Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3924572Z     )
2025-05-07T20:32:35.3924810Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3924904Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3924978Z         self,
2025-05-07T20:32:35.3925052Z         T: int,
2025-05-07T20:32:35.3925129Z         D: int,
2025-05-07T20:32:35.3925225Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3925314Z         contiguous: bool,
2025-05-07T20:32:35.3925398Z         compiled: bool,
2025-05-07T20:32:35.3925473Z     ) -> None:
2025-05-07T20:32:35.3925569Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3925644Z     
2025-05-07T20:32:35.3925806Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3925881Z     
2025-05-07T20:32:35.3925971Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3926094Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3926183Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3926264Z         x0 = x[:, :D]
2025-05-07T20:32:35.3926340Z         x1 = x[:, D:]
2025-05-07T20:32:35.3926409Z     
2025-05-07T20:32:35.3926490Z         if contiguous:
2025-05-07T20:32:35.3926578Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3926666Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3926737Z     
2025-05-07T20:32:35.3926825Z         if scale_ub is not None:
2025-05-07T20:32:35.3926978Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3927109Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3927223Z             )
2025-05-07T20:32:35.3927301Z         else:
2025-05-07T20:32:35.3927395Z             scale_ub_tensor = None
2025-05-07T20:32:35.3927474Z     
2025-05-07T20:32:35.3927601Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3927689Z             op = silu_mul_quant
2025-05-07T20:32:35.3927773Z             if compiled:
2025-05-07T20:32:35.3927870Z                 op = torch.compile(op)
2025-05-07T20:32:35.3928012Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3928082Z     
2025-05-07T20:32:35.3928169Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3928174Z 
2025-05-07T20:32:35.3928267Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3928396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3928495Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3928597Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3929131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3929227Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3929582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3929799Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3930136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3930230Z     kernel = self.compile(
2025-05-07T20:32:35.3930605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3930781Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3930908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3930912Z 
2025-05-07T20:32:35.3931118Z self = <triton.compiler.compiler.ASTSource object at 0x7fb0169c9c10>
2025-05-07T20:32:35.3931882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3932377Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb016850cc0>}
2025-05-07T20:32:35.3933163Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3933349Z context = <triton._C.libtriton.ir.context object at 0x7fb016862cb0>
2025-05-07T20:32:35.3933356Z 
2025-05-07T20:32:35.3933521Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3933782Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3933886Z                            module_map=module_map)
2025-05-07T20:32:35.3934048Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3934143Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3934217Z E       ^
2025-05-07T20:32:35.3934569Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3934577Z 
2025-05-07T20:32:35.3934981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3934985Z 
2025-05-07T20:32:35.3935097Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3939478Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3939643Z     T=2048,
2025-05-07T20:32:35.3939721Z     D=7168,
2025-05-07T20:32:35.3939849Z     scale_ub=None,
2025-05-07T20:32:35.3939943Z     contiguous=False,
2025-05-07T20:32:35.3940023Z     compiled=False,
2025-05-07T20:32:35.3940093Z )
2025-05-07T20:32:35.3940314Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3940488Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.3940493Z 
2025-05-07T20:32:35.3940567Z     @given(
2025-05-07T20:32:35.3940728Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3940832Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3940941Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3941055Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3941167Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3941241Z     )
2025-05-07T20:32:35.3941490Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3941583Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3941698Z         self,
2025-05-07T20:32:35.3941781Z         T: int,
2025-05-07T20:32:35.3941856Z         D: int,
2025-05-07T20:32:35.3941952Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3942044Z         contiguous: bool,
2025-05-07T20:32:35.3942129Z         compiled: bool,
2025-05-07T20:32:35.3942205Z     ) -> None:
2025-05-07T20:32:35.3942300Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3942372Z     
2025-05-07T20:32:35.3942543Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3944313Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3944321Z 
2025-05-07T20:32:35.3944436Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.3944440Z 
2025-05-07T20:32:35.3944544Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3944761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3944844Z     T=128,
2025-05-07T20:32:35.3944915Z     D=7168,
2025-05-07T20:32:35.3944996Z     scale_ub=1200.0,
2025-05-07T20:32:35.3945077Z     contiguous=True,
2025-05-07T20:32:35.3945157Z     compiled=True,
2025-05-07T20:32:35.3945228Z )
2025-05-07T20:32:35.3945447Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3945608Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.3945616Z 
2025-05-07T20:32:35.3945688Z     @given(
2025-05-07T20:32:35.3945809Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3945910Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3946024Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3946137Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3946249Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3946327Z     )
2025-05-07T20:32:35.3946566Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3946660Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3946734Z         self,
2025-05-07T20:32:35.3946806Z         T: int,
2025-05-07T20:32:35.3946882Z         D: int,
2025-05-07T20:32:35.3946981Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3947068Z         contiguous: bool,
2025-05-07T20:32:35.3947150Z         compiled: bool,
2025-05-07T20:32:35.3947276Z     ) -> None:
2025-05-07T20:32:35.3947367Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3947444Z     
2025-05-07T20:32:35.3947650Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3947724Z     
2025-05-07T20:32:35.3947815Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3947939Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3948026Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3948112Z         x0 = x[:, :D]
2025-05-07T20:32:35.3948190Z         x1 = x[:, D:]
2025-05-07T20:32:35.3948259Z     
2025-05-07T20:32:35.3948385Z         if contiguous:
2025-05-07T20:32:35.3948476Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3948562Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3948637Z     
2025-05-07T20:32:35.3948724Z         if scale_ub is not None:
2025-05-07T20:32:35.3948826Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3948960Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3949037Z             )
2025-05-07T20:32:35.3949110Z         else:
2025-05-07T20:32:35.3949203Z             scale_ub_tensor = None
2025-05-07T20:32:35.3949312Z     
2025-05-07T20:32:35.3949451Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3949537Z             op = silu_mul_quant
2025-05-07T20:32:35.3949619Z             if compiled:
2025-05-07T20:32:35.3949719Z                 op = torch.compile(op)
2025-05-07T20:32:35.3949823Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3949894Z     
2025-05-07T20:32:35.3949987Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3949991Z 
2025-05-07T20:32:35.3950085Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3950217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3950317Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3950413Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3950785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3950880Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3951371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3951471Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3951820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3952041Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3952378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3952473Z     kernel = self.compile(
2025-05-07T20:32:35.3952852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3953023Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3953150Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3953162Z 
2025-05-07T20:32:35.3953365Z self = <triton.compiler.compiler.ASTSource object at 0x7fb01681ab70>
2025-05-07T20:32:35.3954130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3954626Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fb149e2d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fb016851a80>}
2025-05-07T20:32:35.3955360Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3955594Z context = <triton._C.libtriton.ir.context object at 0x7fb0168f06b0>
2025-05-07T20:32:35.3955599Z 
2025-05-07T20:32:35.3955798Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3956056Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3956164Z                            module_map=module_map)
2025-05-07T20:32:35.3956324Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3956421Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3956498Z E       ^
2025-05-07T20:32:35.3956886Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3956892Z 
2025-05-07T20:32:35.3957296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3957301Z 
2025-05-07T20:32:35.3957399Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3957618Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3957695Z     T=128,
2025-05-07T20:32:35.3957810Z     D=7168,
2025-05-07T20:32:35.3957895Z     scale_ub=1200.0,
2025-05-07T20:32:35.3957977Z     contiguous=True,
2025-05-07T20:32:35.3958061Z     compiled=False,
2025-05-07T20:32:35.3958135Z )
2025-05-07T20:32:35.3958346Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3958510Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.3958518Z 
2025-05-07T20:32:35.3958594Z     @given(
2025-05-07T20:32:35.3958712Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3958810Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3958923Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3959036Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3959151Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3959452Z     )
2025-05-07T20:32:35.3959767Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3959863Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3959938Z         self,
2025-05-07T20:32:35.3960013Z         T: int,
2025-05-07T20:32:35.3960092Z         D: int,
2025-05-07T20:32:35.3960188Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3960275Z         contiguous: bool,
2025-05-07T20:32:35.3960362Z         compiled: bool,
2025-05-07T20:32:35.3960440Z     ) -> None:
2025-05-07T20:32:35.3960538Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3960615Z     
2025-05-07T20:32:35.3960780Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3960855Z     
2025-05-07T20:32:35.3960945Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3961067Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3962827Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3962835Z 
2025-05-07T20:32:35.3962948Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:35.3962957Z 
2025-05-07T20:32:35.3963061Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3963276Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3963352Z     T=128,
2025-05-07T20:32:35.3963432Z     D=5120,
2025-05-07T20:32:35.3963516Z     scale_ub=1200.0,
2025-05-07T20:32:35.3963600Z     contiguous=True,
2025-05-07T20:32:35.3963774Z     compiled=True,
2025-05-07T20:32:35.3963848Z )
2025-05-07T20:32:35.3964125Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3964293Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.3964297Z 
2025-05-07T20:32:35.3964369Z     @given(
2025-05-07T20:32:35.3964486Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3964584Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3964692Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3964894Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3965003Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3965078Z     )
2025-05-07T20:32:35.3965320Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3965409Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3965488Z         self,
2025-05-07T20:32:35.3965562Z         T: int,
2025-05-07T20:32:35.3965634Z         D: int,
2025-05-07T20:32:35.3965730Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3965819Z         contiguous: bool,
2025-05-07T20:32:35.3965969Z         compiled: bool,
2025-05-07T20:32:35.3966051Z     ) -> None:
2025-05-07T20:32:35.3966143Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3966213Z     
2025-05-07T20:32:35.3966378Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3966449Z     
2025-05-07T20:32:35.3966539Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3966663Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3968408Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3968416Z 
2025-05-07T20:32:35.3968533Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:35.3968538Z 
2025-05-07T20:32:35.3968636Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3968854Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3968928Z     T=128,
2025-05-07T20:32:35.3969004Z     D=7168,
2025-05-07T20:32:35.3969087Z     scale_ub=None,
2025-05-07T20:32:35.3969168Z     contiguous=True,
2025-05-07T20:32:35.3969250Z     compiled=True,
2025-05-07T20:32:35.3969326Z )
2025-05-07T20:32:35.3969537Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3969700Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.3969711Z 
2025-05-07T20:32:35.3969784Z     @given(
2025-05-07T20:32:35.3969898Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3970001Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3970111Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3970224Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3970337Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3970410Z     )
2025-05-07T20:32:35.3970651Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3970748Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3970820Z         self,
2025-05-07T20:32:35.3970892Z         T: int,
2025-05-07T20:32:35.3970967Z         D: int,
2025-05-07T20:32:35.3971063Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3971153Z         contiguous: bool,
2025-05-07T20:32:35.3971237Z         compiled: bool,
2025-05-07T20:32:35.3971313Z     ) -> None:
2025-05-07T20:32:35.3971453Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3971526Z     
2025-05-07T20:32:35.3971725Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3973527Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3973573Z 
2025-05-07T20:32:35.3973724Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.3973859Z =============================== warnings summary ===============================
2025-05-07T20:32:35.3974163Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:35.3974499Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:35.3974793Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:35.3975651Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:35.3975887Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:35.3975892Z 
2025-05-07T20:32:35.3976103Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:35.3976273Z ================= 1 failed, 1 deselected, 3 warnings in 15.08s =================
2025-05-07T20:32:36.9684307Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:37.0302618Z [EXEC] [ATTEMPT 0/2] Command attempt failed.
2025-05-07T20:32:37.0302843Z 
2025-05-07T20:32:39.0321060Z [EXEC] [ATTEMPT 1/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:41.1868365Z ============================= test session starts ==============================
2025-05-07T20:32:41.1869167Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:41.1869745Z cachedir: .pytest_cache
2025-05-07T20:32:41.1870313Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:41.1871044Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:41.1871449Z plugins: hypothesis-6.131.14
2025-05-07T20:32:42.8017523Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:42.9121666Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:42.9122073Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:42.9122288Z 
2025-05-07T20:32:45.2680248Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:45.2680970Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:45.2681385Z     T=1,
2025-05-07T20:32:45.2681573Z     D=5120,
2025-05-07T20:32:45.2681778Z     scale_ub=None,
2025-05-07T20:32:45.2681995Z     contiguous=True,
2025-05-07T20:32:45.2682218Z     compiled=True,
2025-05-07T20:32:45.2682435Z )
2025-05-07T20:32:45.2682758Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:45.2683633Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:45.2683992Z 
2025-05-07T20:32:45.2684079Z     @given(
2025-05-07T20:32:45.2684316Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:45.2684633Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:45.2684936Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:45.2685266Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:45.2685595Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:45.2685976Z     )
2025-05-07T20:32:45.2686326Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:45.2686786Z     def test_silu_mul_quant(
2025-05-07T20:32:45.2687029Z         self,
2025-05-07T20:32:45.2687234Z         T: int,
2025-05-07T20:32:45.2687440Z         D: int,
2025-05-07T20:32:45.2687656Z         scale_ub: Optional[float],
2025-05-07T20:32:45.2687930Z         contiguous: bool,
2025-05-07T20:32:45.2688173Z         compiled: bool,
2025-05-07T20:32:45.2688397Z     ) -> None:
2025-05-07T20:32:45.2688706Z         torch.manual_seed(2025)
2025-05-07T20:32:45.2688958Z     
2025-05-07T20:32:45.2689231Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:45.2689575Z     
2025-05-07T20:32:45.2689779Z         x_sign = torch.sign(x)
2025-05-07T20:32:45.2690067Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:45.2690379Z         x = x_sign * x_clamp
2025-05-07T20:32:45.2690628Z         x0 = x[:, :D]
2025-05-07T20:32:45.2690841Z         x1 = x[:, D:]
2025-05-07T20:32:45.2691059Z     
2025-05-07T20:32:45.2691250Z         if contiguous:
2025-05-07T20:32:45.2691477Z             x0 = x0.contiguous()
2025-05-07T20:32:45.2691735Z             x1 = x1.contiguous()
2025-05-07T20:32:45.2691979Z     
2025-05-07T20:32:45.2692173Z         if scale_ub is not None:
2025-05-07T20:32:45.2692445Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:45.2692783Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:45.2693249Z             )
2025-05-07T20:32:45.2693440Z         else:
2025-05-07T20:32:45.2693656Z             scale_ub_tensor = None
2025-05-07T20:32:45.2693909Z     
2025-05-07T20:32:45.2694139Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.2694454Z             op = silu_mul_quant
2025-05-07T20:32:45.2694704Z             if compiled:
2025-05-07T20:32:45.2694949Z                 op = torch.compile(op)
2025-05-07T20:32:45.2695254Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.2695530Z     
2025-05-07T20:32:45.2695715Z         y_fp8, y_scale = fn()
2025-05-07T20:32:45.2696006Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:45.2696296Z     
2025-05-07T20:32:45.2696532Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.2696870Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:45.2697166Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:45.2697484Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:45.2697848Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:45.2698163Z     
2025-05-07T20:32:45.2698368Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:45.2698561Z 
2025-05-07T20:32:45.2698663Z moe/activation_test.py:126: 
2025-05-07T20:32:45.2698964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.2699301Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:45.2699631Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:45.2700417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:45.2701167Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:45.2701714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:45.2702488Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:45.2703176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:45.2703894Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:45.2704618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:45.2705297Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:45.2705898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:45.2706413Z     fn()
2025-05-07T20:32:45.2706923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:45.2707499Z     self.fn.run(
2025-05-07T20:32:45.2707971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:45.2708551Z     kernel = self.compile(
2025-05-07T20:32:45.2709086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:45.2709735Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.2710186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.2710423Z 
2025-05-07T20:32:45.2710636Z self = <triton.compiler.compiler.ASTSource object at 0x7fc892726240>
2025-05-07T20:32:45.2711709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:45.2713103Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc89114dc60>}
2025-05-07T20:32:45.2714432Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:45.2715448Z context = <triton._C.libtriton.ir.context object at 0x7fc8914130b0>
2025-05-07T20:32:45.2715738Z 
2025-05-07T20:32:45.2715905Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:45.2716429Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.2716898Z                            module_map=module_map)
2025-05-07T20:32:45.2717269Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.2717622Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:45.2717897Z E       ^
2025-05-07T20:32:45.2718366Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.2718813Z 
2025-05-07T20:32:45.2719235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:45.2719743Z 
2025-05-07T20:32:45.2719848Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:45.2720293Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:45.2720722Z     T=2048,
2025-05-07T20:32:45.2720918Z     D=5120,
2025-05-07T20:32:45.2721117Z     scale_ub=1200.0,
2025-05-07T20:32:45.2721344Z     contiguous=True,
2025-05-07T20:32:45.2721564Z     compiled=False,
2025-05-07T20:32:45.2721777Z )
2025-05-07T20:32:46.0045162Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.0045932Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:46.0046625Z 
2025-05-07T20:32:46.0046734Z     @given(
2025-05-07T20:32:46.0047052Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.0047627Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.0047962Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.0048293Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.0048625Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.0048910Z     )
2025-05-07T20:32:46.0049270Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.0049820Z     def test_silu_mul_quant(
2025-05-07T20:32:46.0050063Z         self,
2025-05-07T20:32:46.0050254Z         T: int,
2025-05-07T20:32:46.0050456Z         D: int,
2025-05-07T20:32:46.0050678Z         scale_ub: Optional[float],
2025-05-07T20:32:46.0050950Z         contiguous: bool,
2025-05-07T20:32:46.0051197Z         compiled: bool,
2025-05-07T20:32:46.0051437Z     ) -> None:
2025-05-07T20:32:46.0051654Z         torch.manual_seed(2025)
2025-05-07T20:32:46.0051902Z     
2025-05-07T20:32:46.0052288Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.0052634Z     
2025-05-07T20:32:46.0052837Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.0053257Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.0053566Z         x = x_sign * x_clamp
2025-05-07T20:32:46.0053819Z         x0 = x[:, :D]
2025-05-07T20:32:46.0054040Z         x1 = x[:, D:]
2025-05-07T20:32:46.0054241Z     
2025-05-07T20:32:46.0054438Z         if contiguous:
2025-05-07T20:32:46.0054674Z             x0 = x0.contiguous()
2025-05-07T20:32:46.0054928Z             x1 = x1.contiguous()
2025-05-07T20:32:46.0055172Z     
2025-05-07T20:32:46.0055379Z         if scale_ub is not None:
2025-05-07T20:32:46.0055656Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.0055992Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.0056306Z             )
2025-05-07T20:32:46.0056508Z         else:
2025-05-07T20:32:46.0056716Z             scale_ub_tensor = None
2025-05-07T20:32:46.0056972Z     
2025-05-07T20:32:46.0057209Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.0057522Z             op = silu_mul_quant
2025-05-07T20:32:46.0057778Z             if compiled:
2025-05-07T20:32:46.0058028Z                 op = torch.compile(op)
2025-05-07T20:32:46.0058318Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.0058596Z     
2025-05-07T20:32:46.0058792Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:46.0058960Z 
2025-05-07T20:32:46.0059060Z moe/activation_test.py:117: 
2025-05-07T20:32:46.0059791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.0060135Z moe/activation_test.py:115: in fn
2025-05-07T20:32:46.0060419Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.0061110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:46.0061801Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:46.0062346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.0063022Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.0063684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.0064215Z     kernel = self.compile(
2025-05-07T20:32:46.0064764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.0065405Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.0065807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.0066033Z 
2025-05-07T20:32:46.0066245Z self = <triton.compiler.compiler.ASTSource object at 0x7fc891269ee0>
2025-05-07T20:32:46.0067474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.0068839Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc890db0220>}
2025-05-07T20:32:46.0070169Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.0071251Z context = <triton._C.libtriton.ir.context object at 0x7fc8913cfa30>
2025-05-07T20:32:46.0071535Z 
2025-05-07T20:32:46.0071708Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.0072222Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.0072696Z                            module_map=module_map)
2025-05-07T20:32:46.0073129Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.0073486Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.0073745Z E       ^
2025-05-07T20:32:46.0074211Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.0074655Z 
2025-05-07T20:32:46.0075073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.0075586Z 
2025-05-07T20:32:46.0075697Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.0076104Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.0076509Z     T=2048,
2025-05-07T20:32:46.0076705Z     D=5120,
2025-05-07T20:32:46.0076899Z     scale_ub=1200.0,
2025-05-07T20:32:46.0077127Z     contiguous=True,
2025-05-07T20:32:46.0077357Z     compiled=True,
2025-05-07T20:32:46.0077562Z )
2025-05-07T20:32:46.0077889Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.0078386Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:46.0078651Z 
2025-05-07T20:32:46.0078732Z     @given(
2025-05-07T20:32:46.0078971Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.0079285Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.0079605Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.0079979Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.0080311Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.0080603Z     )
2025-05-07T20:32:46.0080949Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.0081388Z     def test_silu_mul_quant(
2025-05-07T20:32:46.0081630Z         self,
2025-05-07T20:32:46.0081824Z         T: int,
2025-05-07T20:32:46.0082027Z         D: int,
2025-05-07T20:32:46.0082252Z         scale_ub: Optional[float],
2025-05-07T20:32:46.0082521Z         contiguous: bool,
2025-05-07T20:32:46.0082765Z         compiled: bool,
2025-05-07T20:32:46.0083007Z     ) -> None:
2025-05-07T20:32:46.0090439Z         torch.manual_seed(2025)
2025-05-07T20:32:46.0090714Z     
2025-05-07T20:32:46.0091103Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.0091571Z     
2025-05-07T20:32:46.0091852Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.0092189Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.0092498Z         x = x_sign * x_clamp
2025-05-07T20:32:46.0092743Z         x0 = x[:, :D]
2025-05-07T20:32:46.0092964Z         x1 = x[:, D:]
2025-05-07T20:32:46.0093228Z     
2025-05-07T20:32:46.0093446Z         if contiguous:
2025-05-07T20:32:46.0093682Z             x0 = x0.contiguous()
2025-05-07T20:32:46.0094036Z             x1 = x1.contiguous()
2025-05-07T20:32:46.0094267Z     
2025-05-07T20:32:46.0094465Z         if scale_ub is not None:
2025-05-07T20:32:46.0094795Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.0095137Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.0095445Z             )
2025-05-07T20:32:46.0095653Z         else:
2025-05-07T20:32:46.0095871Z             scale_ub_tensor = None
2025-05-07T20:32:46.0096118Z     
2025-05-07T20:32:46.0096357Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.0096724Z             op = silu_mul_quant
2025-05-07T20:32:46.0096976Z             if compiled:
2025-05-07T20:32:46.0097235Z                 op = torch.compile(op)
2025-05-07T20:32:46.0097536Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.0097811Z     
2025-05-07T20:32:46.0098012Z         y_fp8, y_scale = fn()
2025-05-07T20:32:46.0098312Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:46.0098603Z     
2025-05-07T20:32:46.0098852Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.0099241Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:46.0099542Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:46.0099860Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:46.0100229Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:46.0100546Z     
2025-05-07T20:32:46.0100754Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:46.0100953Z 
2025-05-07T20:32:46.0101058Z moe/activation_test.py:126: 
2025-05-07T20:32:46.0101363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.0101692Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:46.0102021Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:46.0102809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:46.0103558Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:46.0104102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.0104780Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.0105463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:46.0106180Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:46.0106895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:46.0107533Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:46.0108131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:46.0108636Z     fn()
2025-05-07T20:32:46.0109144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:46.0109722Z     self.fn.run(
2025-05-07T20:32:46.0110193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.0110709Z     kernel = self.compile(
2025-05-07T20:32:46.0111247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.0111899Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.0112287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.0112521Z 
2025-05-07T20:32:46.0112730Z self = <triton.compiler.compiler.ASTSource object at 0x7fc890dfd6d0>
2025-05-07T20:32:46.0113848Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.0115246Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc890db16c0>}
2025-05-07T20:32:46.0116568Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.0117615Z context = <triton._C.libtriton.ir.context object at 0x7fc88bb333b0>
2025-05-07T20:32:46.0117907Z 
2025-05-07T20:32:46.0118071Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.0118589Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.0119054Z                            module_map=module_map)
2025-05-07T20:32:46.0119419Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.0119791Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:46.0120142Z E       ^
2025-05-07T20:32:46.0120601Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.0121055Z 
2025-05-07T20:32:46.0121467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.0121981Z 
2025-05-07T20:32:46.0122090Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.0122514Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.0122910Z     T=16384,
2025-05-07T20:32:46.0123112Z     D=7168,
2025-05-07T20:32:46.0123315Z     scale_ub=1200.0,
2025-05-07T20:32:46.0123540Z     contiguous=False,
2025-05-07T20:32:46.0123777Z     compiled=False,
2025-05-07T20:32:46.0123990Z )
2025-05-07T20:32:46.7388674Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.7389428Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:46.7389719Z 
2025-05-07T20:32:46.7389803Z     @given(
2025-05-07T20:32:46.7390054Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.7390372Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.7390693Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.7391031Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.7391366Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.7391658Z     )
2025-05-07T20:32:46.7392019Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.7392463Z     def test_silu_mul_quant(
2025-05-07T20:32:46.7392714Z         self,
2025-05-07T20:32:46.7392926Z         T: int,
2025-05-07T20:32:46.7393130Z         D: int,
2025-05-07T20:32:46.7393358Z         scale_ub: Optional[float],
2025-05-07T20:32:46.7393644Z         contiguous: bool,
2025-05-07T20:32:46.7393892Z         compiled: bool,
2025-05-07T20:32:46.7394127Z     ) -> None:
2025-05-07T20:32:46.7394350Z         torch.manual_seed(2025)
2025-05-07T20:32:46.7394594Z     
2025-05-07T20:32:46.7394864Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.7395214Z     
2025-05-07T20:32:46.7395412Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.7395766Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.7396183Z         x = x_sign * x_clamp
2025-05-07T20:32:46.7396446Z         x0 = x[:, :D]
2025-05-07T20:32:46.7396667Z         x1 = x[:, D:]
2025-05-07T20:32:46.7396893Z     
2025-05-07T20:32:46.7397086Z         if contiguous:
2025-05-07T20:32:46.7397317Z             x0 = x0.contiguous()
2025-05-07T20:32:46.7397580Z             x1 = x1.contiguous()
2025-05-07T20:32:46.7397838Z     
2025-05-07T20:32:46.7398031Z         if scale_ub is not None:
2025-05-07T20:32:46.7398609Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.7399045Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.7399371Z             )
2025-05-07T20:32:46.7399566Z         else:
2025-05-07T20:32:46.7399789Z             scale_ub_tensor = None
2025-05-07T20:32:46.7400053Z     
2025-05-07T20:32:46.7400287Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.7400611Z             op = silu_mul_quant
2025-05-07T20:32:46.7400866Z             if compiled:
2025-05-07T20:32:46.7401114Z                 op = torch.compile(op)
2025-05-07T20:32:46.7401510Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.7401793Z     
2025-05-07T20:32:46.7401983Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:46.7402159Z 
2025-05-07T20:32:46.7402262Z moe/activation_test.py:117: 
2025-05-07T20:32:46.7402565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.7402904Z moe/activation_test.py:115: in fn
2025-05-07T20:32:46.7403182Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.7403958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:46.7404650Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:46.7405181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.7405867Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.7406534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.7407066Z     kernel = self.compile(
2025-05-07T20:32:46.7407600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.7408250Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.7408646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.7408872Z 
2025-05-07T20:32:46.7409087Z self = <triton.compiler.compiler.ASTSource object at 0x7fc890eb3e90>
2025-05-07T20:32:46.7410163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.7411543Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88bd28540>}
2025-05-07T20:32:46.7412879Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.7413984Z context = <triton._C.libtriton.ir.context object at 0x7fc88b76a970>
2025-05-07T20:32:46.7414273Z 
2025-05-07T20:32:46.7414441Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.7414959Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.7415427Z                            module_map=module_map)
2025-05-07T20:32:46.7415807Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.7416164Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.7416419Z E       ^
2025-05-07T20:32:46.7416882Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.7417333Z 
2025-05-07T20:32:46.7417747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.7418253Z 
2025-05-07T20:32:46.7418368Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.7418829Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.7419231Z     T=1,
2025-05-07T20:32:46.7419419Z     D=7168,
2025-05-07T20:32:46.7419656Z     scale_ub=None,
2025-05-07T20:32:46.7419881Z     contiguous=True,
2025-05-07T20:32:46.7420143Z     compiled=True,
2025-05-07T20:32:46.7420360Z )
2025-05-07T20:32:46.7420679Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.7421159Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:46.7421410Z 
2025-05-07T20:32:46.7421491Z     @given(
2025-05-07T20:32:46.7421761Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.7422073Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.7422379Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.7422697Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.7423026Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.7423315Z     )
2025-05-07T20:32:46.7423658Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.7424099Z     def test_silu_mul_quant(
2025-05-07T20:32:46.7424390Z         self,
2025-05-07T20:32:46.7424583Z         T: int,
2025-05-07T20:32:46.7424786Z         D: int,
2025-05-07T20:32:46.7425009Z         scale_ub: Optional[float],
2025-05-07T20:32:46.7425278Z         contiguous: bool,
2025-05-07T20:32:46.7425518Z         compiled: bool,
2025-05-07T20:32:46.7425744Z     ) -> None:
2025-05-07T20:32:46.7425972Z         torch.manual_seed(2025)
2025-05-07T20:32:46.7426221Z     
2025-05-07T20:32:46.7426496Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.7426831Z     
2025-05-07T20:32:46.7427033Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.7427333Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.7427650Z         x = x_sign * x_clamp
2025-05-07T20:32:46.7427895Z         x0 = x[:, :D]
2025-05-07T20:32:46.7428126Z         x1 = x[:, D:]
2025-05-07T20:32:46.7428337Z     
2025-05-07T20:32:46.7428525Z         if contiguous:
2025-05-07T20:32:46.7428770Z             x0 = x0.contiguous()
2025-05-07T20:32:46.7429042Z             x1 = x1.contiguous()
2025-05-07T20:32:46.7429282Z     
2025-05-07T20:32:46.7429488Z         if scale_ub is not None:
2025-05-07T20:32:46.7429771Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.7430106Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.7430423Z             )
2025-05-07T20:32:46.7430622Z         else:
2025-05-07T20:32:46.7430838Z             scale_ub_tensor = None
2025-05-07T20:32:46.7431093Z     
2025-05-07T20:32:46.7431330Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.7431643Z             op = silu_mul_quant
2025-05-07T20:32:46.7431900Z             if compiled:
2025-05-07T20:32:46.7432153Z                 op = torch.compile(op)
2025-05-07T20:32:46.7432461Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.7432740Z     
2025-05-07T20:32:46.7432935Z         y_fp8, y_scale = fn()
2025-05-07T20:32:46.7433229Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:46.7433519Z     
2025-05-07T20:32:46.7433763Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.7434103Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:46.7434395Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:46.7434713Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:46.7435080Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:46.7435391Z     
2025-05-07T20:32:46.7435598Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:46.7435801Z 
2025-05-07T20:32:46.7435901Z moe/activation_test.py:126: 
2025-05-07T20:32:46.7436202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.7436538Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:46.7436917Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:46.7437751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:46.7438498Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:46.7439049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.7439726Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.7440451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:46.7441159Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:46.7441886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:46.7442523Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:46.7443159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:46.7443671Z     fn()
2025-05-07T20:32:46.7444180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:46.7444759Z     self.fn.run(
2025-05-07T20:32:46.7445225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.7445753Z     kernel = self.compile(
2025-05-07T20:32:46.7446294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.7446941Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.7447328Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.7447560Z 
2025-05-07T20:32:46.7447767Z self = <triton.compiler.compiler.ASTSource object at 0x7fc8901baea0>
2025-05-07T20:32:46.7448846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.7450265Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88bd28e00>}
2025-05-07T20:32:46.7451588Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.7452607Z context = <triton._C.libtriton.ir.context object at 0x7fc88b5c15f0>
2025-05-07T20:32:46.7452900Z 
2025-05-07T20:32:46.7453149Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.7453674Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.7454145Z                            module_map=module_map)
2025-05-07T20:32:46.7454521Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.7454881Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:46.7455148Z E       ^
2025-05-07T20:32:46.7455613Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.7456064Z 
2025-05-07T20:32:46.7456482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.7456987Z 
2025-05-07T20:32:46.7457100Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.7457515Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.7457921Z     T=4096,
2025-05-07T20:32:46.7458119Z     D=5120,
2025-05-07T20:32:46.7458398Z     scale_ub=None,
2025-05-07T20:32:46.7458619Z     contiguous=False,
2025-05-07T20:32:46.7458851Z     compiled=False,
2025-05-07T20:32:46.7459108Z )
2025-05-07T20:32:47.5320028Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.5320992Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:47.5321347Z 
2025-05-07T20:32:47.5321437Z     @given(
2025-05-07T20:32:47.5321689Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.5322059Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.5322701Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.5323036Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.5323363Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.5323655Z     )
2025-05-07T20:32:47.5324015Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.5324466Z     def test_silu_mul_quant(
2025-05-07T20:32:47.5324724Z         self,
2025-05-07T20:32:47.5324925Z         T: int,
2025-05-07T20:32:47.5325119Z         D: int,
2025-05-07T20:32:47.5325438Z         scale_ub: Optional[float],
2025-05-07T20:32:47.5325714Z         contiguous: bool,
2025-05-07T20:32:47.5325963Z         compiled: bool,
2025-05-07T20:32:47.5326191Z     ) -> None:
2025-05-07T20:32:47.5326412Z         torch.manual_seed(2025)
2025-05-07T20:32:47.5326665Z     
2025-05-07T20:32:47.5326946Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.5327308Z     
2025-05-07T20:32:47.5327508Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.5327797Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.5328110Z         x = x_sign * x_clamp
2025-05-07T20:32:47.5328357Z         x0 = x[:, :D]
2025-05-07T20:32:47.5328571Z         x1 = x[:, D:]
2025-05-07T20:32:47.5328787Z     
2025-05-07T20:32:47.5328978Z         if contiguous:
2025-05-07T20:32:47.5329209Z             x0 = x0.contiguous()
2025-05-07T20:32:47.5329473Z             x1 = x1.contiguous()
2025-05-07T20:32:47.5329725Z     
2025-05-07T20:32:47.5329918Z         if scale_ub is not None:
2025-05-07T20:32:47.5330191Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.5330533Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.5330842Z             )
2025-05-07T20:32:47.5331032Z         else:
2025-05-07T20:32:47.5331251Z             scale_ub_tensor = None
2025-05-07T20:32:47.5331511Z     
2025-05-07T20:32:47.5331743Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.5332066Z             op = silu_mul_quant
2025-05-07T20:32:47.5332318Z             if compiled:
2025-05-07T20:32:47.5332563Z                 op = torch.compile(op)
2025-05-07T20:32:47.5332863Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.5333260Z     
2025-05-07T20:32:47.5333452Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.5333628Z 
2025-05-07T20:32:47.5333729Z moe/activation_test.py:117: 
2025-05-07T20:32:47.5334027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.5334359Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.5334647Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.5335341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.5336030Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.5336562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.5337248Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.5337909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.5338440Z     kernel = self.compile(
2025-05-07T20:32:47.5338978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.5339823Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.5340230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.5340459Z 
2025-05-07T20:32:47.5340667Z self = <triton.compiler.compiler.ASTSource object at 0x7fc890eb3d10>
2025-05-07T20:32:47.5341743Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.5343159Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc890f7f240>}
2025-05-07T20:32:47.5344484Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.5345554Z context = <triton._C.libtriton.ir.context object at 0x7fc88b62c4b0>
2025-05-07T20:32:47.5345841Z 
2025-05-07T20:32:47.5346011Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.5346534Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.5347004Z                            module_map=module_map)
2025-05-07T20:32:47.5347366Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.5347722Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.5347986Z E       ^
2025-05-07T20:32:47.5348447Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.5348890Z 
2025-05-07T20:32:47.5349304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.5349813Z 
2025-05-07T20:32:47.5349917Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.5350381Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.5350789Z     T=4096,
2025-05-07T20:32:47.5350976Z     D=7168,
2025-05-07T20:32:47.5351171Z     scale_ub=None,
2025-05-07T20:32:47.5351397Z     contiguous=False,
2025-05-07T20:32:47.5351622Z     compiled=False,
2025-05-07T20:32:47.5351830Z )
2025-05-07T20:32:47.5352156Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.5352647Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:47.5352929Z 
2025-05-07T20:32:47.5353010Z     @given(
2025-05-07T20:32:47.5353245Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.5353552Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.5353863Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.5354195Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.5354526Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.5354813Z     )
2025-05-07T20:32:47.5355168Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.5355612Z     def test_silu_mul_quant(
2025-05-07T20:32:47.5355850Z         self,
2025-05-07T20:32:47.5356050Z         T: int,
2025-05-07T20:32:47.5356252Z         D: int,
2025-05-07T20:32:47.5356470Z         scale_ub: Optional[float],
2025-05-07T20:32:47.5356751Z         contiguous: bool,
2025-05-07T20:32:47.5356994Z         compiled: bool,
2025-05-07T20:32:47.5357216Z     ) -> None:
2025-05-07T20:32:47.5357436Z         torch.manual_seed(2025)
2025-05-07T20:32:47.5357685Z     
2025-05-07T20:32:47.5357957Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.5358303Z     
2025-05-07T20:32:47.5358505Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.5358867Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.5359668Z         x = x_sign * x_clamp
2025-05-07T20:32:47.5360001Z         x0 = x[:, :D]
2025-05-07T20:32:47.5360234Z         x1 = x[:, D:]
2025-05-07T20:32:47.5360454Z     
2025-05-07T20:32:47.5367707Z         if contiguous:
2025-05-07T20:32:47.5367955Z             x0 = x0.contiguous()
2025-05-07T20:32:47.5368212Z             x1 = x1.contiguous()
2025-05-07T20:32:47.5368462Z     
2025-05-07T20:32:47.5368663Z         if scale_ub is not None:
2025-05-07T20:32:47.5368943Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.5369406Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.5369722Z             )
2025-05-07T20:32:47.5369926Z         else:
2025-05-07T20:32:47.5370139Z             scale_ub_tensor = None
2025-05-07T20:32:47.5370402Z     
2025-05-07T20:32:47.5370645Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.5370962Z             op = silu_mul_quant
2025-05-07T20:32:47.5371225Z             if compiled:
2025-05-07T20:32:47.5371480Z                 op = torch.compile(op)
2025-05-07T20:32:47.5371843Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.5372128Z     
2025-05-07T20:32:47.5372333Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.5372498Z 
2025-05-07T20:32:47.5372602Z moe/activation_test.py:117: 
2025-05-07T20:32:47.5372908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.5373320Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.5373605Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.5374292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.5374989Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.5375565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.5376540Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.5377484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.5378238Z     kernel = self.compile(
2025-05-07T20:32:47.5379098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.5380098Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.5380703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.5381006Z 
2025-05-07T20:32:47.5381221Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88bd571d0>
2025-05-07T20:32:47.5382302Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.5383676Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88b181440>}
2025-05-07T20:32:47.5385015Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.5386050Z context = <triton._C.libtriton.ir.context object at 0x7fc88aced030>
2025-05-07T20:32:47.5386344Z 
2025-05-07T20:32:47.5386527Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.5387063Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.5387538Z                            module_map=module_map)
2025-05-07T20:32:47.5387924Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.5388423Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.5388697Z E       ^
2025-05-07T20:32:47.5389222Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.5389677Z 
2025-05-07T20:32:47.5390102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.5390613Z 
2025-05-07T20:32:47.5390734Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.5391153Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.5391612Z     T=128,
2025-05-07T20:32:47.5391819Z     D=7168,
2025-05-07T20:32:47.5392023Z     scale_ub=None,
2025-05-07T20:32:47.5392258Z     contiguous=False,
2025-05-07T20:32:47.5392509Z     compiled=True,
2025-05-07T20:32:47.5392725Z )
2025-05-07T20:32:47.5949362Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.5950067Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:47.5950423Z 
2025-05-07T20:32:47.5950515Z     @given(
2025-05-07T20:32:47.5950965Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.5951294Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.5951621Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.5951956Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.5952302Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.5952601Z     )
2025-05-07T20:32:47.5952977Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.5953429Z     def test_silu_mul_quant(
2025-05-07T20:32:47.5953688Z         self,
2025-05-07T20:32:47.5953897Z         T: int,
2025-05-07T20:32:47.5954102Z         D: int,
2025-05-07T20:32:47.5954331Z         scale_ub: Optional[float],
2025-05-07T20:32:47.5954612Z         contiguous: bool,
2025-05-07T20:32:47.5954852Z         compiled: bool,
2025-05-07T20:32:47.5955097Z     ) -> None:
2025-05-07T20:32:47.5955327Z         torch.manual_seed(2025)
2025-05-07T20:32:47.5955576Z     
2025-05-07T20:32:47.5955864Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.5956219Z     
2025-05-07T20:32:47.5956415Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.5956716Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.5957030Z         x = x_sign * x_clamp
2025-05-07T20:32:47.5957273Z         x0 = x[:, :D]
2025-05-07T20:32:47.5957498Z         x1 = x[:, D:]
2025-05-07T20:32:47.5957732Z     
2025-05-07T20:32:47.5957922Z         if contiguous:
2025-05-07T20:32:47.5958172Z             x0 = x0.contiguous()
2025-05-07T20:32:47.5958441Z             x1 = x1.contiguous()
2025-05-07T20:32:47.5958688Z     
2025-05-07T20:32:47.5958887Z         if scale_ub is not None:
2025-05-07T20:32:47.5959165Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.5959825Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.5960137Z             )
2025-05-07T20:32:47.5960344Z         else:
2025-05-07T20:32:47.5960585Z             scale_ub_tensor = None
2025-05-07T20:32:47.5960848Z     
2025-05-07T20:32:47.5961081Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.5961400Z             op = silu_mul_quant
2025-05-07T20:32:47.5961656Z             if compiled:
2025-05-07T20:32:47.5961909Z                 op = torch.compile(op)
2025-05-07T20:32:47.5962207Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.5962491Z     
2025-05-07T20:32:47.5962688Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.5962977Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.5963277Z     
2025-05-07T20:32:47.5963518Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.5963859Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.5964159Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.5964563Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.5964924Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.5965331Z     
2025-05-07T20:32:47.5965552Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.5965750Z 
2025-05-07T20:32:47.5965853Z moe/activation_test.py:126: 
2025-05-07T20:32:47.5966161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.5966500Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.5966824Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.5967684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.5968439Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.5968988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.5969665Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.5970411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.5971134Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.5971863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.5972496Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.5973160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.5973680Z     fn()
2025-05-07T20:32:47.5974184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.5974772Z     self.fn.run(
2025-05-07T20:32:47.5975244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.5975779Z     kernel = self.compile(
2025-05-07T20:32:47.5976319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.5976968Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.5977374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.5977603Z 
2025-05-07T20:32:47.5977822Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88b387ad0>
2025-05-07T20:32:47.5978914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.5980334Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88af64540>}
2025-05-07T20:32:47.5981727Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.5982745Z context = <triton._C.libtriton.ir.context object at 0x7fc88b559cb0>
2025-05-07T20:32:47.5983029Z 
2025-05-07T20:32:47.5983194Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.5983714Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.5984185Z                            module_map=module_map)
2025-05-07T20:32:47.5984554Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.5984907Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.5985181Z E       ^
2025-05-07T20:32:47.5985649Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.5986155Z 
2025-05-07T20:32:47.5986617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.5987140Z 
2025-05-07T20:32:47.5987248Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.5987672Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.5988078Z     T=128,
2025-05-07T20:32:47.5988263Z     D=7168,
2025-05-07T20:32:47.5988469Z     scale_ub=None,
2025-05-07T20:32:47.5988738Z     contiguous=False,
2025-05-07T20:32:47.5988968Z     compiled=False,
2025-05-07T20:32:47.5989185Z )
2025-05-07T20:32:47.7942635Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.7943367Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:47.7943733Z 
2025-05-07T20:32:47.7943844Z     @given(
2025-05-07T20:32:47.7944134Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.7944458Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.7945054Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.7945390Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.7945727Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.7946016Z     )
2025-05-07T20:32:47.7946365Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.7946817Z     def test_silu_mul_quant(
2025-05-07T20:32:47.7947071Z         self,
2025-05-07T20:32:47.7947275Z         T: int,
2025-05-07T20:32:47.7947484Z         D: int,
2025-05-07T20:32:47.7947712Z         scale_ub: Optional[float],
2025-05-07T20:32:47.7947984Z         contiguous: bool,
2025-05-07T20:32:47.7948237Z         compiled: bool,
2025-05-07T20:32:47.7948475Z     ) -> None:
2025-05-07T20:32:47.7948691Z         torch.manual_seed(2025)
2025-05-07T20:32:47.7948941Z     
2025-05-07T20:32:47.7949219Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.7949564Z     
2025-05-07T20:32:47.7949766Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.7950068Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.7950385Z         x = x_sign * x_clamp
2025-05-07T20:32:47.7950633Z         x0 = x[:, :D]
2025-05-07T20:32:47.7950856Z         x1 = x[:, D:]
2025-05-07T20:32:47.7951072Z     
2025-05-07T20:32:47.7951260Z         if contiguous:
2025-05-07T20:32:47.7951495Z             x0 = x0.contiguous()
2025-05-07T20:32:47.7951762Z             x1 = x1.contiguous()
2025-05-07T20:32:47.7952002Z     
2025-05-07T20:32:47.7952219Z         if scale_ub is not None:
2025-05-07T20:32:47.7952503Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.7952840Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.7953160Z             )
2025-05-07T20:32:47.7953365Z         else:
2025-05-07T20:32:47.7953586Z             scale_ub_tensor = None
2025-05-07T20:32:47.7953857Z     
2025-05-07T20:32:47.7954098Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.7954415Z             op = silu_mul_quant
2025-05-07T20:32:47.7954676Z             if compiled:
2025-05-07T20:32:47.7954933Z                 op = torch.compile(op)
2025-05-07T20:32:47.7955232Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.7955524Z     
2025-05-07T20:32:47.7955731Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.7955900Z 
2025-05-07T20:32:47.7956012Z moe/activation_test.py:117: 
2025-05-07T20:32:47.7956325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.7956664Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.7956953Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.7957642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.7958455Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.7959070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.7960085Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.7960739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.7961272Z     kernel = self.compile(
2025-05-07T20:32:47.7961813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.7962550Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.7962950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.7963181Z 
2025-05-07T20:32:47.7963389Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88b3effb0>
2025-05-07T20:32:47.7964520Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.7965891Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88af66700>}
2025-05-07T20:32:47.7967221Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.7968241Z context = <triton._C.libtriton.ir.context object at 0x7fc88b28f0f0>
2025-05-07T20:32:47.7968557Z 
2025-05-07T20:32:47.7968725Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.7969245Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.7969716Z                            module_map=module_map)
2025-05-07T20:32:47.7970081Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.7970439Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.7970704Z E       ^
2025-05-07T20:32:47.7971167Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.7971615Z 
2025-05-07T20:32:47.7972027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.7972548Z 
2025-05-07T20:32:47.7972651Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.7973133Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.7973529Z     T=4096,
2025-05-07T20:32:47.7973724Z     D=5120,
2025-05-07T20:32:47.7973925Z     scale_ub=1200.0,
2025-05-07T20:32:47.7974147Z     contiguous=True,
2025-05-07T20:32:47.7974373Z     compiled=False,
2025-05-07T20:32:47.7974591Z )
2025-05-07T20:32:47.7974909Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.7975414Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:47.7975688Z 
2025-05-07T20:32:47.7975778Z     @given(
2025-05-07T20:32:47.7976015Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.7976325Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.7976641Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.7976974Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.7977313Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.7977608Z     )
2025-05-07T20:32:47.7977961Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.7978399Z     def test_silu_mul_quant(
2025-05-07T20:32:47.7978654Z         self,
2025-05-07T20:32:47.7978863Z         T: int,
2025-05-07T20:32:47.7979132Z         D: int,
2025-05-07T20:32:47.7979359Z         scale_ub: Optional[float],
2025-05-07T20:32:47.7979640Z         contiguous: bool,
2025-05-07T20:32:47.7979950Z         compiled: bool,
2025-05-07T20:32:47.7980180Z     ) -> None:
2025-05-07T20:32:47.7980405Z         torch.manual_seed(2025)
2025-05-07T20:32:47.7980646Z     
2025-05-07T20:32:47.7980925Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.7981274Z     
2025-05-07T20:32:47.7981472Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.7981759Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.7982118Z         x = x_sign * x_clamp
2025-05-07T20:32:47.7982365Z         x0 = x[:, :D]
2025-05-07T20:32:47.7982578Z         x1 = x[:, D:]
2025-05-07T20:32:47.7982794Z     
2025-05-07T20:32:47.7982981Z         if contiguous:
2025-05-07T20:32:47.7983211Z             x0 = x0.contiguous()
2025-05-07T20:32:47.7983476Z             x1 = x1.contiguous()
2025-05-07T20:32:47.7983723Z     
2025-05-07T20:32:47.7983916Z         if scale_ub is not None:
2025-05-07T20:32:47.7984193Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.7984580Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.7984888Z             )
2025-05-07T20:32:47.7985089Z         else:
2025-05-07T20:32:47.7985304Z             scale_ub_tensor = None
2025-05-07T20:32:47.7985556Z     
2025-05-07T20:32:47.7985795Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.7986114Z             op = silu_mul_quant
2025-05-07T20:32:47.7986372Z             if compiled:
2025-05-07T20:32:47.7986628Z                 op = torch.compile(op)
2025-05-07T20:32:47.7986929Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.7987212Z     
2025-05-07T20:32:47.7987406Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.7987578Z 
2025-05-07T20:32:47.7987678Z moe/activation_test.py:117: 
2025-05-07T20:32:47.7987982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.7988312Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.7988603Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.7989294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.7989978Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.7990510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.7991193Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.7991857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.7992379Z     kernel = self.compile(
2025-05-07T20:32:47.7992924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.7993579Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.7993981Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.7994213Z 
2025-05-07T20:32:47.7994423Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88b3edee0>
2025-05-07T20:32:47.7995491Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.7996848Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88af676a0>}
2025-05-07T20:32:47.7998178Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.7999245Z context = <triton._C.libtriton.ir.context object at 0x7fc88a74b330>
2025-05-07T20:32:47.7999533Z 
2025-05-07T20:32:47.7999738Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.8000264Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.8000786Z                            module_map=module_map)
2025-05-07T20:32:47.8001148Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.8001512Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.8001772Z E       ^
2025-05-07T20:32:47.8002278Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.8002723Z 
2025-05-07T20:32:47.8003135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.8003645Z 
2025-05-07T20:32:47.8003751Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.8004175Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.8004567Z     T=1,
2025-05-07T20:32:47.8004797Z     D=5120,
2025-05-07T20:32:47.8004998Z     scale_ub=None,
2025-05-07T20:32:47.8005210Z     contiguous=True,
2025-05-07T20:32:47.8005438Z     compiled=True,
2025-05-07T20:32:47.8005648Z )
2025-05-07T20:32:48.1738602Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.1739318Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:48.1739701Z 
2025-05-07T20:32:48.1739786Z     @given(
2025-05-07T20:32:48.1740038Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.1740360Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.1740677Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.1741017Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.1741359Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.1741662Z     )
2025-05-07T20:32:48.1742017Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.1742480Z     def test_silu_mul_quant(
2025-05-07T20:32:48.1742727Z         self,
2025-05-07T20:32:48.1742923Z         T: int,
2025-05-07T20:32:48.1743128Z         D: int,
2025-05-07T20:32:48.1743353Z         scale_ub: Optional[float],
2025-05-07T20:32:48.1743629Z         contiguous: bool,
2025-05-07T20:32:48.1743888Z         compiled: bool,
2025-05-07T20:32:48.1744125Z     ) -> None:
2025-05-07T20:32:48.1744350Z         torch.manual_seed(2025)
2025-05-07T20:32:48.1744607Z     
2025-05-07T20:32:48.1744882Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.1745225Z     
2025-05-07T20:32:48.1745435Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.1745733Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.1746050Z         x = x_sign * x_clamp
2025-05-07T20:32:48.1746321Z         x0 = x[:, :D]
2025-05-07T20:32:48.1746540Z         x1 = x[:, D:]
2025-05-07T20:32:48.1746759Z     
2025-05-07T20:32:48.1746957Z         if contiguous:
2025-05-07T20:32:48.1747190Z             x0 = x0.contiguous()
2025-05-07T20:32:48.1747452Z             x1 = x1.contiguous()
2025-05-07T20:32:48.1747700Z     
2025-05-07T20:32:48.1747894Z         if scale_ub is not None:
2025-05-07T20:32:48.1748177Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.1748524Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.1748831Z             )
2025-05-07T20:32:48.1749034Z         else:
2025-05-07T20:32:48.1749255Z             scale_ub_tensor = None
2025-05-07T20:32:48.1749508Z     
2025-05-07T20:32:48.1749752Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.1750077Z             op = silu_mul_quant
2025-05-07T20:32:48.1750333Z             if compiled:
2025-05-07T20:32:48.1750594Z                 op = torch.compile(op)
2025-05-07T20:32:48.1751184Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.1751469Z     
2025-05-07T20:32:48.1751665Z         y_fp8, y_scale = fn()
2025-05-07T20:32:48.1752057Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:48.1752361Z     
2025-05-07T20:32:48.1752596Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.1752942Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:48.1753242Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:48.1753554Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:48.1754013Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:48.1754349Z     
2025-05-07T20:32:48.1761773Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:48.1761999Z 
2025-05-07T20:32:48.1762109Z moe/activation_test.py:126: 
2025-05-07T20:32:48.1762423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.1762771Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:48.1763106Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:48.1764056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:48.1764814Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:48.1765350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.1766029Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.1766720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:48.1767437Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:48.1768152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:48.1768788Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:48.1769394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:48.1769903Z     fn()
2025-05-07T20:32:48.1770410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:48.1770992Z     self.fn.run(
2025-05-07T20:32:48.1771458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.1771979Z     kernel = self.compile(
2025-05-07T20:32:48.1772517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.1773231Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.1773619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.1773854Z 
2025-05-07T20:32:48.1774060Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88b6eadb0>
2025-05-07T20:32:48.1775134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.1776498Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88bd43880>}
2025-05-07T20:32:48.1777823Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.1778825Z context = <triton._C.libtriton.ir.context object at 0x7fc88b359df0>
2025-05-07T20:32:48.1779117Z 
2025-05-07T20:32:48.1779282Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.1779946Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.1780413Z                            module_map=module_map)
2025-05-07T20:32:48.1780772Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.1781128Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:48.1781400Z E       ^
2025-05-07T20:32:48.1781859Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.1782382Z 
2025-05-07T20:32:48.1782792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.1783303Z 
2025-05-07T20:32:48.1783406Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.1783821Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.1784209Z     T=2048,
2025-05-07T20:32:48.1784403Z     D=5120,
2025-05-07T20:32:48.1784603Z     scale_ub=None,
2025-05-07T20:32:48.1784816Z     contiguous=True,
2025-05-07T20:32:48.1785044Z     compiled=True,
2025-05-07T20:32:48.1785301Z )
2025-05-07T20:32:48.5388958Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.5389701Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:48.5390037Z 
2025-05-07T20:32:48.5390121Z     @given(
2025-05-07T20:32:48.5390380Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.5390728Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.5391044Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.5391397Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.5391744Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.5392036Z     )
2025-05-07T20:32:48.5392395Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.5392865Z     def test_silu_mul_quant(
2025-05-07T20:32:48.5393120Z         self,
2025-05-07T20:32:48.5393325Z         T: int,
2025-05-07T20:32:48.5393550Z         D: int,
2025-05-07T20:32:48.5393792Z         scale_ub: Optional[float],
2025-05-07T20:32:48.5394075Z         contiguous: bool,
2025-05-07T20:32:48.5394337Z         compiled: bool,
2025-05-07T20:32:48.5394593Z     ) -> None:
2025-05-07T20:32:48.5394823Z         torch.manual_seed(2025)
2025-05-07T20:32:48.5395087Z     
2025-05-07T20:32:48.5395376Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.5395731Z     
2025-05-07T20:32:48.5395940Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.5396254Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.5396569Z         x = x_sign * x_clamp
2025-05-07T20:32:48.5396827Z         x0 = x[:, :D]
2025-05-07T20:32:48.5397057Z         x1 = x[:, D:]
2025-05-07T20:32:48.5397269Z     
2025-05-07T20:32:48.5397479Z         if contiguous:
2025-05-07T20:32:48.5397732Z             x0 = x0.contiguous()
2025-05-07T20:32:48.5398014Z             x1 = x1.contiguous()
2025-05-07T20:32:48.5398362Z     
2025-05-07T20:32:48.5398644Z         if scale_ub is not None:
2025-05-07T20:32:48.5399035Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.5399502Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.5399919Z             )
2025-05-07T20:32:48.5400182Z         else:
2025-05-07T20:32:48.5400469Z             scale_ub_tensor = None
2025-05-07T20:32:48.5400857Z     
2025-05-07T20:32:48.5401221Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.5401746Z             op = silu_mul_quant
2025-05-07T20:32:48.5402124Z             if compiled:
2025-05-07T20:32:48.5402474Z                 op = torch.compile(op)
2025-05-07T20:32:48.5402889Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.5403279Z     
2025-05-07T20:32:48.5403494Z         y_fp8, y_scale = fn()
2025-05-07T20:32:48.5403973Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:48.5404269Z     
2025-05-07T20:32:48.5404607Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.5404957Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:48.5405257Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:48.5405574Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:48.5405941Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:48.5406265Z     
2025-05-07T20:32:48.5406469Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:48.5406754Z 
2025-05-07T20:32:48.5406861Z moe/activation_test.py:126: 
2025-05-07T20:32:48.5407169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.5407509Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:48.5407842Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:48.5408639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:48.5409483Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:48.5410027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.5410740Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.5411452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:48.5412187Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:48.5412917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:48.5413701Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:48.5414306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:48.5414831Z     fn()
2025-05-07T20:32:48.5415348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:48.5415934Z     self.fn.run(
2025-05-07T20:32:48.5416405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.5416929Z     kernel = self.compile(
2025-05-07T20:32:48.5417473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.5418137Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.5418534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.5418769Z 
2025-05-07T20:32:48.5418981Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a846c00>
2025-05-07T20:32:48.5420068Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.5421441Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88b6f3ba0>}
2025-05-07T20:32:48.5422770Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.5423784Z context = <triton._C.libtriton.ir.context object at 0x7fc88a65a7b0>
2025-05-07T20:32:48.5424082Z 
2025-05-07T20:32:48.5424250Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.5424771Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.5425300Z                            module_map=module_map)
2025-05-07T20:32:48.5425705Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.5426073Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:48.5426353Z E       ^
2025-05-07T20:32:48.5426812Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.5427263Z 
2025-05-07T20:32:48.5427675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.5428231Z 
2025-05-07T20:32:48.5428339Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.5428761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.5429158Z     T=128,
2025-05-07T20:32:48.5429364Z     D=5120,
2025-05-07T20:32:48.5429574Z     scale_ub=None,
2025-05-07T20:32:48.5429793Z     contiguous=True,
2025-05-07T20:32:48.5430031Z     compiled=True,
2025-05-07T20:32:48.5430246Z )
2025-05-07T20:32:48.9620790Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.9621835Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:48.9622122Z 
2025-05-07T20:32:48.9622208Z     @given(
2025-05-07T20:32:48.9622451Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.9622779Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.9623086Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.9623434Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.9623777Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.9624070Z     )
2025-05-07T20:32:48.9624428Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.9624876Z     def test_silu_mul_quant(
2025-05-07T20:32:48.9625123Z         self,
2025-05-07T20:32:48.9625316Z         T: int,
2025-05-07T20:32:48.9625536Z         D: int,
2025-05-07T20:32:48.9625765Z         scale_ub: Optional[float],
2025-05-07T20:32:48.9626043Z         contiguous: bool,
2025-05-07T20:32:48.9626303Z         compiled: bool,
2025-05-07T20:32:48.9626538Z     ) -> None:
2025-05-07T20:32:48.9626753Z         torch.manual_seed(2025)
2025-05-07T20:32:48.9626999Z     
2025-05-07T20:32:48.9627281Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.9627627Z     
2025-05-07T20:32:48.9627828Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.9628136Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.9628445Z         x = x_sign * x_clamp
2025-05-07T20:32:48.9628692Z         x0 = x[:, :D]
2025-05-07T20:32:48.9628913Z         x1 = x[:, D:]
2025-05-07T20:32:48.9629126Z     
2025-05-07T20:32:48.9629311Z         if contiguous:
2025-05-07T20:32:48.9629552Z             x0 = x0.contiguous()
2025-05-07T20:32:48.9629814Z             x1 = x1.contiguous()
2025-05-07T20:32:48.9630052Z     
2025-05-07T20:32:48.9630244Z         if scale_ub is not None:
2025-05-07T20:32:48.9630525Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.9630890Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.9631231Z             )
2025-05-07T20:32:48.9631431Z         else:
2025-05-07T20:32:48.9631641Z             scale_ub_tensor = None
2025-05-07T20:32:48.9631894Z     
2025-05-07T20:32:48.9632131Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.9632446Z             op = silu_mul_quant
2025-05-07T20:32:48.9632710Z             if compiled:
2025-05-07T20:32:48.9632985Z                 op = torch.compile(op)
2025-05-07T20:32:48.9633287Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.9633572Z     
2025-05-07T20:32:48.9633774Z         y_fp8, y_scale = fn()
2025-05-07T20:32:48.9634059Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:48.9634361Z     
2025-05-07T20:32:48.9634607Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.9635035Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:48.9635458Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:48.9635781Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:48.9636140Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:48.9636445Z     
2025-05-07T20:32:48.9636652Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:48.9636845Z 
2025-05-07T20:32:48.9636959Z moe/activation_test.py:126: 
2025-05-07T20:32:48.9637256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.9637674Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:48.9638005Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:48.9638782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:48.9639531Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:48.9640076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.9640796Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.9641524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:48.9642241Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:48.9642968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:48.9643604Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:48.9644195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:48.9644714Z     fn()
2025-05-07T20:32:48.9645218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:48.9645788Z     self.fn.run(
2025-05-07T20:32:48.9646265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.9646797Z     kernel = self.compile(
2025-05-07T20:32:48.9647339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.9647979Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.9648378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.9648603Z 
2025-05-07T20:32:48.9648818Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a8444a0>
2025-05-07T20:32:48.9649893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.9651318Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a4914e0>}
2025-05-07T20:32:48.9652644Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.9653798Z context = <triton._C.libtriton.ir.context object at 0x7fc88aedd830>
2025-05-07T20:32:48.9654087Z 
2025-05-07T20:32:48.9654260Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.9654774Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.9655247Z                            module_map=module_map)
2025-05-07T20:32:48.9655618Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.9656035Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:48.9656302Z E       ^
2025-05-07T20:32:48.9656814Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.9657263Z 
2025-05-07T20:32:48.9657682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.9658185Z 
2025-05-07T20:32:48.9658297Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.9658705Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.9659147Z     T=4096,
2025-05-07T20:32:48.9659633Z     D=5120,
2025-05-07T20:32:48.9659828Z     scale_ub=None,
2025-05-07T20:32:48.9660047Z     contiguous=True,
2025-05-07T20:32:48.9660279Z     compiled=True,
2025-05-07T20:32:48.9660492Z )
2025-05-07T20:32:49.3884811Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.3885569Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:49.3885940Z 
2025-05-07T20:32:49.3886029Z     @given(
2025-05-07T20:32:49.3886545Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.3886862Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.3887160Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.3887496Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.3887823Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.3888102Z     )
2025-05-07T20:32:49.3888456Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.3888896Z     def test_silu_mul_quant(
2025-05-07T20:32:49.3889130Z         self,
2025-05-07T20:32:49.3889324Z         T: int,
2025-05-07T20:32:49.3889527Z         D: int,
2025-05-07T20:32:49.3889741Z         scale_ub: Optional[float],
2025-05-07T20:32:49.3890014Z         contiguous: bool,
2025-05-07T20:32:49.3890262Z         compiled: bool,
2025-05-07T20:32:49.3890491Z     ) -> None:
2025-05-07T20:32:49.3890707Z         torch.manual_seed(2025)
2025-05-07T20:32:49.3890958Z     
2025-05-07T20:32:49.3891274Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.3891616Z     
2025-05-07T20:32:49.3891813Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.3892108Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.3892414Z         x = x_sign * x_clamp
2025-05-07T20:32:49.3892660Z         x0 = x[:, :D]
2025-05-07T20:32:49.3892888Z         x1 = x[:, D:]
2025-05-07T20:32:49.3893182Z     
2025-05-07T20:32:49.3893375Z         if contiguous:
2025-05-07T20:32:49.3893609Z             x0 = x0.contiguous()
2025-05-07T20:32:49.3893866Z             x1 = x1.contiguous()
2025-05-07T20:32:49.3894112Z     
2025-05-07T20:32:49.3894310Z         if scale_ub is not None:
2025-05-07T20:32:49.3894584Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.3894924Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.3895236Z             )
2025-05-07T20:32:49.3895431Z         else:
2025-05-07T20:32:49.3895657Z             scale_ub_tensor = None
2025-05-07T20:32:49.3895920Z     
2025-05-07T20:32:49.3896154Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.3896459Z             op = silu_mul_quant
2025-05-07T20:32:49.3896710Z             if compiled:
2025-05-07T20:32:49.3896958Z                 op = torch.compile(op)
2025-05-07T20:32:49.3897251Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.3897530Z     
2025-05-07T20:32:49.3897728Z         y_fp8, y_scale = fn()
2025-05-07T20:32:49.3898011Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:49.3898304Z     
2025-05-07T20:32:49.3898542Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.3898867Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:49.3899155Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:49.3899555Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:49.3899987Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.3900303Z     
2025-05-07T20:32:49.3900509Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:49.3900702Z 
2025-05-07T20:32:49.3900818Z moe/activation_test.py:126: 
2025-05-07T20:32:49.3901145Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.3901480Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:49.3901961Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.3902735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:49.3903479Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:49.3904018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.3904695Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.3905415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:49.3906131Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:49.3906854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:49.3907486Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:49.3908080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:49.3908597Z     fn()
2025-05-07T20:32:49.3909103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:49.3909674Z     self.fn.run(
2025-05-07T20:32:49.3910145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.3910672Z     kernel = self.compile(
2025-05-07T20:32:49.3911211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.3911850Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.3912248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.3912473Z 
2025-05-07T20:32:49.3912685Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a868fe0>
2025-05-07T20:32:49.3913761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.3915125Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a2b54e0>}
2025-05-07T20:32:49.3916449Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.3917459Z context = <triton._C.libtriton.ir.context object at 0x7fc88aaf66b0>
2025-05-07T20:32:49.3917745Z 
2025-05-07T20:32:49.3917917Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.3918425Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.3918889Z                            module_map=module_map)
2025-05-07T20:32:49.3919252Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.3919610Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:49.3919873Z E       ^
2025-05-07T20:32:49.3920394Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.3920839Z 
2025-05-07T20:32:49.3921300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.3921802Z 
2025-05-07T20:32:49.3921905Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.3922316Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.3922711Z     T=16384,
2025-05-07T20:32:49.3922903Z     D=5120,
2025-05-07T20:32:49.3923141Z     scale_ub=None,
2025-05-07T20:32:49.3923353Z     contiguous=True,
2025-05-07T20:32:49.3923579Z     compiled=True,
2025-05-07T20:32:49.3923783Z )
2025-05-07T20:32:49.4183343Z W0507 20:32:49.416000 99018 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:49.4184580Z W0507 20:32:49.416000 99018 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:49.4186151Z W0507 20:32:49.416000 99018 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:49.4187121Z W0507 20:32:49.416000 99018 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:49.4188205Z W0507 20:32:49.416000 99018 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:49.5071203Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.5071748Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:49.5072018Z 
2025-05-07T20:32:49.5072122Z     @given(
2025-05-07T20:32:49.5072353Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.5072679Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.5073004Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.5073340Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.5073660Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.5073946Z     )
2025-05-07T20:32:49.5074297Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.5074739Z     def test_silu_mul_quant(
2025-05-07T20:32:49.5074989Z         self,
2025-05-07T20:32:49.5075190Z         T: int,
2025-05-07T20:32:49.5075390Z         D: int,
2025-05-07T20:32:49.5075619Z         scale_ub: Optional[float],
2025-05-07T20:32:49.5075895Z         contiguous: bool,
2025-05-07T20:32:49.5076139Z         compiled: bool,
2025-05-07T20:32:49.5076370Z     ) -> None:
2025-05-07T20:32:49.5076587Z         torch.manual_seed(2025)
2025-05-07T20:32:49.5076830Z     
2025-05-07T20:32:49.5077106Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.5077457Z     
2025-05-07T20:32:49.5084611Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.5084955Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.5085288Z         x = x_sign * x_clamp
2025-05-07T20:32:49.5085546Z         x0 = x[:, :D]
2025-05-07T20:32:49.5085756Z         x1 = x[:, D:]
2025-05-07T20:32:49.5085969Z     
2025-05-07T20:32:49.5086161Z         if contiguous:
2025-05-07T20:32:49.5086394Z             x0 = x0.contiguous()
2025-05-07T20:32:49.5086657Z             x1 = x1.contiguous()
2025-05-07T20:32:49.5086900Z     
2025-05-07T20:32:49.5087092Z         if scale_ub is not None:
2025-05-07T20:32:49.5087372Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.5087716Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.5088020Z             )
2025-05-07T20:32:49.5088488Z         else:
2025-05-07T20:32:49.5088711Z             scale_ub_tensor = None
2025-05-07T20:32:49.5088957Z     
2025-05-07T20:32:49.5089292Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.5089615Z             op = silu_mul_quant
2025-05-07T20:32:49.5089874Z             if compiled:
2025-05-07T20:32:49.5090117Z                 op = torch.compile(op)
2025-05-07T20:32:49.5090415Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.5090690Z     
2025-05-07T20:32:49.5090885Z         y_fp8, y_scale = fn()
2025-05-07T20:32:49.5091172Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:49.5091544Z     
2025-05-07T20:32:49.5091779Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.5092117Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:49.5092418Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:49.5092727Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:49.5093174Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.5093491Z     
2025-05-07T20:32:49.5093686Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:49.5093887Z 
2025-05-07T20:32:49.5094076Z moe/activation_test.py:126: 
2025-05-07T20:32:49.5094378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.5094709Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:49.5095026Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.5095814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:49.5096561Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:49.5097102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.5097775Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.5098467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:49.5099186Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:49.5099898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:49.5100529Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:49.5101117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:49.5101631Z     fn()
2025-05-07T20:32:49.5102132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:49.5102730Z     self.fn.run(
2025-05-07T20:32:49.5103200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.5103728Z     kernel = self.compile(
2025-05-07T20:32:49.5104270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.5104923Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.5105326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.5105554Z 
2025-05-07T20:32:49.5105765Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88b70f980>
2025-05-07T20:32:49.5106845Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.5108224Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79fd1afc0>}
2025-05-07T20:32:49.5109641Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.5110661Z context = <triton._C.libtriton.ir.context object at 0x7fc88ae238f0>
2025-05-07T20:32:49.5110950Z 
2025-05-07T20:32:49.5111116Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.5111638Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.5112109Z                            module_map=module_map)
2025-05-07T20:32:49.5112513Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.5112877Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:49.5113154Z E       ^
2025-05-07T20:32:49.5113622Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.5114074Z 
2025-05-07T20:32:49.5114488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.5115003Z 
2025-05-07T20:32:49.5115153Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.5115573Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.5115973Z     T=1,
2025-05-07T20:32:49.5116167Z     D=5120,
2025-05-07T20:32:49.5116370Z     scale_ub=1200.0,
2025-05-07T20:32:49.5116601Z     contiguous=True,
2025-05-07T20:32:49.5116820Z     compiled=True,
2025-05-07T20:32:49.5117039Z )
2025-05-07T20:32:49.6532855Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.6533452Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:49.6533715Z 
2025-05-07T20:32:49.6533797Z     @given(
2025-05-07T20:32:49.6534046Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.6534366Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.6534689Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.6535036Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.6535383Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.6535687Z     )
2025-05-07T20:32:49.6536041Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.6536502Z     def test_silu_mul_quant(
2025-05-07T20:32:49.6536763Z         self,
2025-05-07T20:32:49.6536967Z         T: int,
2025-05-07T20:32:49.6537179Z         D: int,
2025-05-07T20:32:49.6537426Z         scale_ub: Optional[float],
2025-05-07T20:32:49.6537702Z         contiguous: bool,
2025-05-07T20:32:49.6537953Z         compiled: bool,
2025-05-07T20:32:49.6538204Z     ) -> None:
2025-05-07T20:32:49.6538432Z         torch.manual_seed(2025)
2025-05-07T20:32:49.6538698Z     
2025-05-07T20:32:49.6538988Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.6539337Z     
2025-05-07T20:32:49.6539541Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.6539846Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.6540176Z         x = x_sign * x_clamp
2025-05-07T20:32:49.6540426Z         x0 = x[:, :D]
2025-05-07T20:32:49.6540657Z         x1 = x[:, D:]
2025-05-07T20:32:49.6540885Z     
2025-05-07T20:32:49.6541083Z         if contiguous:
2025-05-07T20:32:49.6541331Z             x0 = x0.contiguous()
2025-05-07T20:32:49.6541602Z             x1 = x1.contiguous()
2025-05-07T20:32:49.6541847Z     
2025-05-07T20:32:49.6542053Z         if scale_ub is not None:
2025-05-07T20:32:49.6542344Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.6542688Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.6543004Z             )
2025-05-07T20:32:49.6543205Z         else:
2025-05-07T20:32:49.6543420Z             scale_ub_tensor = None
2025-05-07T20:32:49.6543684Z     
2025-05-07T20:32:49.6543927Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.6544500Z             op = silu_mul_quant
2025-05-07T20:32:49.6544759Z             if compiled:
2025-05-07T20:32:49.6545131Z                 op = torch.compile(op)
2025-05-07T20:32:49.6545434Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.6545718Z     
2025-05-07T20:32:49.6545918Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.6546085Z 
2025-05-07T20:32:49.6546197Z moe/activation_test.py:117: 
2025-05-07T20:32:49.6546492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.6546913Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.6547202Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.6547763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:49.6548322Z     return fn(*args, **kwargs)
2025-05-07T20:32:49.6548979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.6549672Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.6550277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.6550954Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.6551622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.6552148Z     kernel = self.compile(
2025-05-07T20:32:49.6552692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.6553351Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.6553761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.6553990Z 
2025-05-07T20:32:49.6554205Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a86bef0>
2025-05-07T20:32:49.6555294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.6556668Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a9d9260>}
2025-05-07T20:32:49.6557997Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.6559018Z context = <triton._C.libtriton.ir.context object at 0x7fc88ad6e8f0>
2025-05-07T20:32:49.6559550Z 
2025-05-07T20:32:49.6559724Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.6560258Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.6560745Z                            module_map=module_map)
2025-05-07T20:32:49.6561125Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.6561501Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.6561776Z E       ^
2025-05-07T20:32:49.6562249Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.6562695Z 
2025-05-07T20:32:49.6563111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.6563628Z 
2025-05-07T20:32:49.6563737Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.6564161Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.6564577Z     T=1,
2025-05-07T20:32:49.6564777Z     D=5120,
2025-05-07T20:32:49.6564988Z     scale_ub=None,
2025-05-07T20:32:49.6565293Z     contiguous=False,
2025-05-07T20:32:49.6565521Z     compiled=True,
2025-05-07T20:32:49.6565735Z )
2025-05-07T20:32:49.8720448Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.8721355Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:49.8721627Z 
2025-05-07T20:32:49.8721714Z     @given(
2025-05-07T20:32:49.8721947Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.8722271Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.8722581Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.8723057Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.8723384Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.8723677Z     )
2025-05-07T20:32:49.8724018Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.8724465Z     def test_silu_mul_quant(
2025-05-07T20:32:49.8724717Z         self,
2025-05-07T20:32:49.8724915Z         T: int,
2025-05-07T20:32:49.8725113Z         D: int,
2025-05-07T20:32:49.8725337Z         scale_ub: Optional[float],
2025-05-07T20:32:49.8725688Z         contiguous: bool,
2025-05-07T20:32:49.8725934Z         compiled: bool,
2025-05-07T20:32:49.8726161Z     ) -> None:
2025-05-07T20:32:49.8726385Z         torch.manual_seed(2025)
2025-05-07T20:32:49.8726624Z     
2025-05-07T20:32:49.8726898Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.8727236Z     
2025-05-07T20:32:49.8727428Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.8727722Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.8728027Z         x = x_sign * x_clamp
2025-05-07T20:32:49.8728282Z         x0 = x[:, :D]
2025-05-07T20:32:49.8728508Z         x1 = x[:, D:]
2025-05-07T20:32:49.8728721Z     
2025-05-07T20:32:49.8728907Z         if contiguous:
2025-05-07T20:32:49.8729152Z             x0 = x0.contiguous()
2025-05-07T20:32:49.8729414Z             x1 = x1.contiguous()
2025-05-07T20:32:49.8729651Z     
2025-05-07T20:32:49.8729848Z         if scale_ub is not None:
2025-05-07T20:32:49.8730133Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.8730464Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.8730773Z             )
2025-05-07T20:32:49.8730978Z         else:
2025-05-07T20:32:49.8731210Z             scale_ub_tensor = None
2025-05-07T20:32:49.8731499Z     
2025-05-07T20:32:49.8731733Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.8732048Z             op = silu_mul_quant
2025-05-07T20:32:49.8732296Z             if compiled:
2025-05-07T20:32:49.8732547Z                 op = torch.compile(op)
2025-05-07T20:32:49.8732842Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.8733223Z     
2025-05-07T20:32:49.8733420Z         y_fp8, y_scale = fn()
2025-05-07T20:32:49.8733705Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:49.8733996Z     
2025-05-07T20:32:49.8734239Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.8734576Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:49.8734866Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:49.8735175Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:49.8735534Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.8735843Z     
2025-05-07T20:32:49.8736045Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:49.8736242Z 
2025-05-07T20:32:49.8736342Z moe/activation_test.py:126: 
2025-05-07T20:32:49.8736650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.8736980Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:49.8737310Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.8738090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:49.8738913Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:49.8739496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.8740172Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.8740852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:49.8741562Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:49.8742324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:49.8742960Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:49.8743556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:49.8744065Z     fn()
2025-05-07T20:32:49.8744572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:49.8745189Z     self.fn.run(
2025-05-07T20:32:49.8745652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.8746189Z     kernel = self.compile(
2025-05-07T20:32:49.8746733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.8747380Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.8747784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.8748020Z 
2025-05-07T20:32:49.8748229Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a4ba300>
2025-05-07T20:32:49.8749301Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.8750673Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a9dbd80>}
2025-05-07T20:32:49.8752054Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.8753077Z context = <triton._C.libtriton.ir.context object at 0x7fc79f5562b0>
2025-05-07T20:32:49.8753372Z 
2025-05-07T20:32:49.8753541Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.8754074Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.8754536Z                            module_map=module_map)
2025-05-07T20:32:49.8754908Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.8755273Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:49.8755551Z E       ^
2025-05-07T20:32:49.8756013Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.8756464Z 
2025-05-07T20:32:49.8756873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.8757372Z 
2025-05-07T20:32:49.8757487Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.8757898Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.8758304Z     T=1,
2025-05-07T20:32:49.8758503Z     D=5120,
2025-05-07T20:32:49.8758702Z     scale_ub=None,
2025-05-07T20:32:49.8758920Z     contiguous=True,
2025-05-07T20:32:49.8759150Z     compiled=False,
2025-05-07T20:32:49.8759660Z )
2025-05-07T20:32:50.0275177Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.0276146Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:50.0276512Z 
2025-05-07T20:32:50.0276592Z     @given(
2025-05-07T20:32:50.0276825Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.0277145Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.0277450Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.0277789Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.0278123Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.0278488Z     )
2025-05-07T20:32:50.0278844Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.0279287Z     def test_silu_mul_quant(
2025-05-07T20:32:50.0279532Z         self,
2025-05-07T20:32:50.0279727Z         T: int,
2025-05-07T20:32:50.0279936Z         D: int,
2025-05-07T20:32:50.0280161Z         scale_ub: Optional[float],
2025-05-07T20:32:50.0280435Z         contiguous: bool,
2025-05-07T20:32:50.0280679Z         compiled: bool,
2025-05-07T20:32:50.0280919Z     ) -> None:
2025-05-07T20:32:50.0281216Z         torch.manual_seed(2025)
2025-05-07T20:32:50.0281521Z     
2025-05-07T20:32:50.0281804Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.0282144Z     
2025-05-07T20:32:50.0282345Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.0282642Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.0282954Z         x = x_sign * x_clamp
2025-05-07T20:32:50.0283211Z         x0 = x[:, :D]
2025-05-07T20:32:50.0283435Z         x1 = x[:, D:]
2025-05-07T20:32:50.0283645Z     
2025-05-07T20:32:50.0283838Z         if contiguous:
2025-05-07T20:32:50.0284079Z             x0 = x0.contiguous()
2025-05-07T20:32:50.0284342Z             x1 = x1.contiguous()
2025-05-07T20:32:50.0284598Z     
2025-05-07T20:32:50.0284796Z         if scale_ub is not None:
2025-05-07T20:32:50.0285078Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.0285412Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.0285728Z             )
2025-05-07T20:32:50.0285936Z         else:
2025-05-07T20:32:50.0286149Z             scale_ub_tensor = None
2025-05-07T20:32:50.0286404Z     
2025-05-07T20:32:50.0286638Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.0286942Z             op = silu_mul_quant
2025-05-07T20:32:50.0287192Z             if compiled:
2025-05-07T20:32:50.0287442Z                 op = torch.compile(op)
2025-05-07T20:32:50.0287734Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.0288014Z     
2025-05-07T20:32:50.0288212Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.0288374Z 
2025-05-07T20:32:50.0288474Z moe/activation_test.py:117: 
2025-05-07T20:32:50.0288772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.0289101Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.0289391Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.0290073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.0290756Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.0291299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.0292013Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.0292668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.0293274Z     kernel = self.compile(
2025-05-07T20:32:50.0293812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.0294453Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.0294901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.0295125Z 
2025-05-07T20:32:50.0295386Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88aa12300>
2025-05-07T20:32:50.0296452Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.0297820Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79fb00fe0>}
2025-05-07T20:32:50.0299191Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.0300205Z context = <triton._C.libtriton.ir.context object at 0x7fc79f55cf70>
2025-05-07T20:32:50.0300492Z 
2025-05-07T20:32:50.0300670Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.0301241Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.0301713Z                            module_map=module_map)
2025-05-07T20:32:50.0302088Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.0302448Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.0302714Z E       ^
2025-05-07T20:32:50.0303180Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.0303628Z 
2025-05-07T20:32:50.0304051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.0304554Z 
2025-05-07T20:32:50.0304669Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.0305077Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.0305482Z     T=128,
2025-05-07T20:32:50.0305677Z     D=5120,
2025-05-07T20:32:50.0305882Z     scale_ub=None,
2025-05-07T20:32:50.0306105Z     contiguous=False,
2025-05-07T20:32:50.0306341Z     compiled=True,
2025-05-07T20:32:50.0306545Z )
2025-05-07T20:32:50.0306868Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.0307365Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:50.0307633Z 
2025-05-07T20:32:50.0307716Z     @given(
2025-05-07T20:32:50.0307956Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.0308275Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.0308591Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.0308914Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.0309240Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.0309524Z     )
2025-05-07T20:32:50.0309869Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.0310309Z     def test_silu_mul_quant(
2025-05-07T20:32:50.0310561Z         self,
2025-05-07T20:32:50.0310759Z         T: int,
2025-05-07T20:32:50.0310961Z         D: int,
2025-05-07T20:32:50.0311199Z         scale_ub: Optional[float],
2025-05-07T20:32:50.0311505Z         contiguous: bool,
2025-05-07T20:32:50.0311753Z         compiled: bool,
2025-05-07T20:32:50.0311979Z     ) -> None:
2025-05-07T20:32:50.0312190Z         torch.manual_seed(2025)
2025-05-07T20:32:50.0312439Z     
2025-05-07T20:32:50.0312716Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.0313054Z     
2025-05-07T20:32:50.0313244Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.0313537Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.0313852Z         x = x_sign * x_clamp
2025-05-07T20:32:50.0314094Z         x0 = x[:, :D]
2025-05-07T20:32:50.0314372Z         x1 = x[:, D:]
2025-05-07T20:32:50.0314584Z     
2025-05-07T20:32:50.0314776Z         if contiguous:
2025-05-07T20:32:50.0315014Z             x0 = x0.contiguous()
2025-05-07T20:32:50.0315353Z             x1 = x1.contiguous()
2025-05-07T20:32:50.0315596Z     
2025-05-07T20:32:50.0315796Z         if scale_ub is not None:
2025-05-07T20:32:50.0316074Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.0316403Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.0323982Z             )
2025-05-07T20:32:50.0324203Z         else:
2025-05-07T20:32:50.0324416Z             scale_ub_tensor = None
2025-05-07T20:32:50.0324768Z     
2025-05-07T20:32:50.0325013Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.0325330Z             op = silu_mul_quant
2025-05-07T20:32:50.0325590Z             if compiled:
2025-05-07T20:32:50.0325848Z                 op = torch.compile(op)
2025-05-07T20:32:50.0326142Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.0326426Z     
2025-05-07T20:32:50.0326627Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.0326796Z 
2025-05-07T20:32:50.0326906Z moe/activation_test.py:117: 
2025-05-07T20:32:50.0327254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.0327591Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.0327879Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.0328427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.0328983Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.0329640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.0330319Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.0330843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.0331515Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.0332184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.0332703Z     kernel = self.compile(
2025-05-07T20:32:50.0333337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.0333989Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.0334387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.0334615Z 
2025-05-07T20:32:50.0334820Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88bd54650>
2025-05-07T20:32:50.0335889Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.0337249Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a9db1a0>}
2025-05-07T20:32:50.0338570Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.0339576Z context = <triton._C.libtriton.ir.context object at 0x7fc79f78e470>
2025-05-07T20:32:50.0339858Z 
2025-05-07T20:32:50.0340026Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.0340543Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.0341010Z                            module_map=module_map)
2025-05-07T20:32:50.0341415Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.0341773Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.0342085Z E       ^
2025-05-07T20:32:50.0342587Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.0343029Z 
2025-05-07T20:32:50.0343440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.0343960Z 
2025-05-07T20:32:50.0344065Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.0344476Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.0344906Z     T=128,
2025-05-07T20:32:50.0345099Z     D=7168,
2025-05-07T20:32:50.0345296Z     scale_ub=1200.0,
2025-05-07T20:32:50.0345531Z     contiguous=False,
2025-05-07T20:32:50.0345753Z     compiled=False,
2025-05-07T20:32:50.0345964Z )
2025-05-07T20:32:50.1487762Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.1488425Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:50.1488773Z 
2025-05-07T20:32:50.1488856Z     @given(
2025-05-07T20:32:50.1489107Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.1489729Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.1490051Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.1490391Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.1490714Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.1491005Z     )
2025-05-07T20:32:50.1491370Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.1491871Z     def test_silu_mul_quant(
2025-05-07T20:32:50.1492120Z         self,
2025-05-07T20:32:50.1492322Z         T: int,
2025-05-07T20:32:50.1492532Z         D: int,
2025-05-07T20:32:50.1492752Z         scale_ub: Optional[float],
2025-05-07T20:32:50.1493106Z         contiguous: bool,
2025-05-07T20:32:50.1493351Z         compiled: bool,
2025-05-07T20:32:50.1493588Z     ) -> None:
2025-05-07T20:32:50.1493809Z         torch.manual_seed(2025)
2025-05-07T20:32:50.1494057Z     
2025-05-07T20:32:50.1494340Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.1494694Z     
2025-05-07T20:32:50.1494902Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.1495195Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.1495511Z         x = x_sign * x_clamp
2025-05-07T20:32:50.1495764Z         x0 = x[:, :D]
2025-05-07T20:32:50.1495980Z         x1 = x[:, D:]
2025-05-07T20:32:50.1496202Z     
2025-05-07T20:32:50.1496401Z         if contiguous:
2025-05-07T20:32:50.1496638Z             x0 = x0.contiguous()
2025-05-07T20:32:50.1496911Z             x1 = x1.contiguous()
2025-05-07T20:32:50.1497160Z     
2025-05-07T20:32:50.1497357Z         if scale_ub is not None:
2025-05-07T20:32:50.1497643Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.1497986Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.1498306Z             )
2025-05-07T20:32:50.1498501Z         else:
2025-05-07T20:32:50.1498722Z             scale_ub_tensor = None
2025-05-07T20:32:50.1498995Z     
2025-05-07T20:32:50.1499234Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.1499595Z             op = silu_mul_quant
2025-05-07T20:32:50.1499845Z             if compiled:
2025-05-07T20:32:50.1500102Z                 op = torch.compile(op)
2025-05-07T20:32:50.1500402Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.1500680Z     
2025-05-07T20:32:50.1500883Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.1501050Z 
2025-05-07T20:32:50.1501158Z moe/activation_test.py:117: 
2025-05-07T20:32:50.1501453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.1501791Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.1502080Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.1502765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.1503547Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.1504168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.1504850Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.1505507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.1506037Z     kernel = self.compile(
2025-05-07T20:32:50.1506657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.1507311Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.1507712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.1507947Z 
2025-05-07T20:32:50.1508158Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88b3ee510>
2025-05-07T20:32:50.1509285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.1510653Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88b3c80e0>}
2025-05-07T20:32:50.1512237Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.1513510Z context = <triton._C.libtriton.ir.context object at 0x7fc79ffab1f0>
2025-05-07T20:32:50.1513813Z 
2025-05-07T20:32:50.1513983Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.1514508Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.1514978Z                            module_map=module_map)
2025-05-07T20:32:50.1515357Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.1515716Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.1515979Z E       ^
2025-05-07T20:32:50.1516449Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.1516901Z 
2025-05-07T20:32:50.1517320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.1517827Z 
2025-05-07T20:32:50.1517940Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.1518348Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.1518759Z     T=128,
2025-05-07T20:32:50.1518956Z     D=5120,
2025-05-07T20:32:50.1519161Z     scale_ub=None,
2025-05-07T20:32:50.1519375Z     contiguous=False,
2025-05-07T20:32:50.1519616Z     compiled=False,
2025-05-07T20:32:50.1519844Z )
2025-05-07T20:32:50.1520171Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.1520672Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:50.1521005Z 
2025-05-07T20:32:50.1521115Z     @given(
2025-05-07T20:32:50.1521403Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.1521804Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.1522203Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.1522613Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.1523006Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.1523302Z     )
2025-05-07T20:32:50.1523660Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.1524148Z     def test_silu_mul_quant(
2025-05-07T20:32:50.1524397Z         self,
2025-05-07T20:32:50.1524602Z         T: int,
2025-05-07T20:32:50.1524799Z         D: int,
2025-05-07T20:32:50.1525069Z         scale_ub: Optional[float],
2025-05-07T20:32:50.1525356Z         contiguous: bool,
2025-05-07T20:32:50.1525599Z         compiled: bool,
2025-05-07T20:32:50.1525830Z     ) -> None:
2025-05-07T20:32:50.1526054Z         torch.manual_seed(2025)
2025-05-07T20:32:50.1526293Z     
2025-05-07T20:32:50.1526578Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.1526922Z     
2025-05-07T20:32:50.1527162Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.1527455Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.1527770Z         x = x_sign * x_clamp
2025-05-07T20:32:50.1528013Z         x0 = x[:, :D]
2025-05-07T20:32:50.1528239Z         x1 = x[:, D:]
2025-05-07T20:32:50.1528458Z     
2025-05-07T20:32:50.1528651Z         if contiguous:
2025-05-07T20:32:50.1528895Z             x0 = x0.contiguous()
2025-05-07T20:32:50.1529163Z             x1 = x1.contiguous()
2025-05-07T20:32:50.1529402Z     
2025-05-07T20:32:50.1529605Z         if scale_ub is not None:
2025-05-07T20:32:50.1529935Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.1530269Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.1530587Z             )
2025-05-07T20:32:50.1530791Z         else:
2025-05-07T20:32:50.1531013Z             scale_ub_tensor = None
2025-05-07T20:32:50.1531264Z     
2025-05-07T20:32:50.1531498Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.1531817Z             op = silu_mul_quant
2025-05-07T20:32:50.1532064Z             if compiled:
2025-05-07T20:32:50.1532321Z                 op = torch.compile(op)
2025-05-07T20:32:50.1532620Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.1532892Z     
2025-05-07T20:32:50.1533154Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.1533321Z 
2025-05-07T20:32:50.1533429Z moe/activation_test.py:117: 
2025-05-07T20:32:50.1533721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.1534058Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.1534344Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.1535032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.1535711Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.1536265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.1536949Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.1537606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.1538142Z     kernel = self.compile(
2025-05-07T20:32:50.1538690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.1539338Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.1539745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.1539980Z 
2025-05-07T20:32:50.1540187Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79fd29ee0>
2025-05-07T20:32:50.1541284Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.1542665Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88b3bea20>}
2025-05-07T20:32:50.1543981Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.1545126Z context = <triton._C.libtriton.ir.context object at 0x7fc79f70d8b0>
2025-05-07T20:32:50.1545421Z 
2025-05-07T20:32:50.1545591Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.1546111Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.1546572Z                            module_map=module_map)
2025-05-07T20:32:50.1546941Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.1547341Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.1547601Z E       ^
2025-05-07T20:32:50.1548066Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.1548518Z 
2025-05-07T20:32:50.1548930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.1549437Z 
2025-05-07T20:32:50.1549548Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.1550002Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.1550407Z     T=128,
2025-05-07T20:32:50.1550602Z     D=5120,
2025-05-07T20:32:50.1550796Z     scale_ub=1200.0,
2025-05-07T20:32:50.1551022Z     contiguous=True,
2025-05-07T20:32:50.1551262Z     compiled=False,
2025-05-07T20:32:50.1551501Z )
2025-05-07T20:32:50.3288612Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.3290075Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:50.3290830Z 
2025-05-07T20:32:50.3291053Z     @given(
2025-05-07T20:32:50.3291534Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.3291893Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.3292207Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.3292545Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.3292875Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.3293251Z     )
2025-05-07T20:32:50.3293606Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.3294047Z     def test_silu_mul_quant(
2025-05-07T20:32:50.3294288Z         self,
2025-05-07T20:32:50.3294495Z         T: int,
2025-05-07T20:32:50.3294698Z         D: int,
2025-05-07T20:32:50.3294916Z         scale_ub: Optional[float],
2025-05-07T20:32:50.3295203Z         contiguous: bool,
2025-05-07T20:32:50.3295453Z         compiled: bool,
2025-05-07T20:32:50.3295687Z     ) -> None:
2025-05-07T20:32:50.3295907Z         torch.manual_seed(2025)
2025-05-07T20:32:50.3296156Z     
2025-05-07T20:32:50.3296441Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.3296782Z     
2025-05-07T20:32:50.3296987Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.3297287Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.3297602Z         x = x_sign * x_clamp
2025-05-07T20:32:50.3297860Z         x0 = x[:, :D]
2025-05-07T20:32:50.3298094Z         x1 = x[:, D:]
2025-05-07T20:32:50.3298302Z     
2025-05-07T20:32:50.3298507Z         if contiguous:
2025-05-07T20:32:50.3298752Z             x0 = x0.contiguous()
2025-05-07T20:32:50.3299015Z             x1 = x1.contiguous()
2025-05-07T20:32:50.3299270Z     
2025-05-07T20:32:50.3299473Z         if scale_ub is not None:
2025-05-07T20:32:50.3299751Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.3300096Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.3300410Z             )
2025-05-07T20:32:50.3300609Z         else:
2025-05-07T20:32:50.3300829Z             scale_ub_tensor = None
2025-05-07T20:32:50.3301092Z     
2025-05-07T20:32:50.3301333Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.3301644Z             op = silu_mul_quant
2025-05-07T20:32:50.3302182Z             if compiled:
2025-05-07T20:32:50.3302440Z                 op = torch.compile(op)
2025-05-07T20:32:50.3302814Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.3303092Z     
2025-05-07T20:32:50.3303300Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.3303466Z 
2025-05-07T20:32:50.3303565Z moe/activation_test.py:117: 
2025-05-07T20:32:50.3303871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.3304205Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.3304485Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.3305252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.3305937Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.3306474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.3307149Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.3307873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.3308409Z     kernel = self.compile(
2025-05-07T20:32:50.3308952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.3309600Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.3309997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.3310227Z 
2025-05-07T20:32:50.3310440Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79fb4eed0>
2025-05-07T20:32:50.3311552Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.3312937Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79fb39120>}
2025-05-07T20:32:50.3314262Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.3315277Z context = <triton._C.libtriton.ir.context object at 0x7fc79f7e6a30>
2025-05-07T20:32:50.3315563Z 
2025-05-07T20:32:50.3315740Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.3316257Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.3316732Z                            module_map=module_map)
2025-05-07T20:32:50.3317102Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.3317456Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.3317732Z E       ^
2025-05-07T20:32:50.3318207Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.3318652Z 
2025-05-07T20:32:50.3319071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.3319574Z 
2025-05-07T20:32:50.3319680Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.3320098Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.3320512Z     T=1,
2025-05-07T20:32:50.3320711Z     D=7168,
2025-05-07T20:32:50.3320905Z     scale_ub=1200.0,
2025-05-07T20:32:50.3321135Z     contiguous=True,
2025-05-07T20:32:50.3321369Z     compiled=True,
2025-05-07T20:32:50.3321579Z )
2025-05-07T20:32:50.3321906Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.3322398Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:50.3322709Z 
2025-05-07T20:32:50.3322791Z     @given(
2025-05-07T20:32:50.3323075Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.3323395Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.3323706Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.3324041Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.3324374Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.3324665Z     )
2025-05-07T20:32:50.3325010Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.3325497Z     def test_silu_mul_quant(
2025-05-07T20:32:50.3325744Z         self,
2025-05-07T20:32:50.3325939Z         T: int,
2025-05-07T20:32:50.3326145Z         D: int,
2025-05-07T20:32:50.3326366Z         scale_ub: Optional[float],
2025-05-07T20:32:50.3326640Z         contiguous: bool,
2025-05-07T20:32:50.3326889Z         compiled: bool,
2025-05-07T20:32:50.3327127Z     ) -> None:
2025-05-07T20:32:50.3327344Z         torch.manual_seed(2025)
2025-05-07T20:32:50.3327594Z     
2025-05-07T20:32:50.3327915Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.3328255Z     
2025-05-07T20:32:50.3328458Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.3328756Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.3329075Z         x = x_sign * x_clamp
2025-05-07T20:32:50.3329330Z         x0 = x[:, :D]
2025-05-07T20:32:50.3329556Z         x1 = x[:, D:]
2025-05-07T20:32:50.3329779Z     
2025-05-07T20:32:50.3329982Z         if contiguous:
2025-05-07T20:32:50.3330222Z             x0 = x0.contiguous()
2025-05-07T20:32:50.3330491Z             x1 = x1.contiguous()
2025-05-07T20:32:50.3330739Z     
2025-05-07T20:32:50.3330946Z         if scale_ub is not None:
2025-05-07T20:32:50.3331236Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.3331572Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.3331893Z             )
2025-05-07T20:32:50.3332105Z         else:
2025-05-07T20:32:50.3332324Z             scale_ub_tensor = None
2025-05-07T20:32:50.3332591Z     
2025-05-07T20:32:50.3332841Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.3333242Z             op = silu_mul_quant
2025-05-07T20:32:50.3333507Z             if compiled:
2025-05-07T20:32:50.3333769Z                 op = torch.compile(op)
2025-05-07T20:32:50.3334064Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.3334348Z     
2025-05-07T20:32:50.3334565Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.3334732Z 
2025-05-07T20:32:50.3334841Z moe/activation_test.py:117: 
2025-05-07T20:32:50.3335139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.3335474Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.3335767Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.3336325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.3336894Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.3337569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.3338258Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.3338798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.3339480Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.3340152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.3340678Z     kernel = self.compile(
2025-05-07T20:32:50.3341237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.3341930Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.3342380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.3342645Z 
2025-05-07T20:32:50.3342857Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79fb4c1a0>
2025-05-07T20:32:50.3343926Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.3345279Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79fb3a2a0>}
2025-05-07T20:32:50.3346645Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.3347657Z context = <triton._C.libtriton.ir.context object at 0x7fc79f658230>
2025-05-07T20:32:50.3347952Z 
2025-05-07T20:32:50.3348199Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.3348724Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.3349197Z                            module_map=module_map)
2025-05-07T20:32:50.3349563Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.3349926Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.3350199Z E       ^
2025-05-07T20:32:50.3350668Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.3351123Z 
2025-05-07T20:32:50.3351571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.3352103Z 
2025-05-07T20:32:50.3352212Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.3352637Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.3353031Z     T=1,
2025-05-07T20:32:50.3353228Z     D=7168,
2025-05-07T20:32:50.3353444Z     scale_ub=1200.0,
2025-05-07T20:32:50.3361615Z     contiguous=False,
2025-05-07T20:32:50.3361874Z     compiled=True,
2025-05-07T20:32:50.3362085Z )
2025-05-07T20:32:50.4679182Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.4679876Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:50.4680200Z 
2025-05-07T20:32:50.4680281Z     @given(
2025-05-07T20:32:50.4680523Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.4680841Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.4681145Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.4681484Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.4681819Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.4682110Z     )
2025-05-07T20:32:50.4682463Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.4682922Z     def test_silu_mul_quant(
2025-05-07T20:32:50.4683167Z         self,
2025-05-07T20:32:50.4683363Z         T: int,
2025-05-07T20:32:50.4683567Z         D: int,
2025-05-07T20:32:50.4683787Z         scale_ub: Optional[float],
2025-05-07T20:32:50.4684055Z         contiguous: bool,
2025-05-07T20:32:50.4684302Z         compiled: bool,
2025-05-07T20:32:50.4684538Z     ) -> None:
2025-05-07T20:32:50.4684761Z         torch.manual_seed(2025)
2025-05-07T20:32:50.4685017Z     
2025-05-07T20:32:50.4685304Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.4685653Z     
2025-05-07T20:32:50.4685855Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.4686153Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.4686458Z         x = x_sign * x_clamp
2025-05-07T20:32:50.4686991Z         x0 = x[:, :D]
2025-05-07T20:32:50.4687216Z         x1 = x[:, D:]
2025-05-07T20:32:50.4687424Z     
2025-05-07T20:32:50.4687716Z         if contiguous:
2025-05-07T20:32:50.4687963Z             x0 = x0.contiguous()
2025-05-07T20:32:50.4688222Z             x1 = x1.contiguous()
2025-05-07T20:32:50.4688468Z     
2025-05-07T20:32:50.4688665Z         if scale_ub is not None:
2025-05-07T20:32:50.4688944Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.4689276Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.4689591Z             )
2025-05-07T20:32:50.4689879Z         else:
2025-05-07T20:32:50.4690094Z             scale_ub_tensor = None
2025-05-07T20:32:50.4690356Z     
2025-05-07T20:32:50.4690597Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.4690908Z             op = silu_mul_quant
2025-05-07T20:32:50.4691171Z             if compiled:
2025-05-07T20:32:50.4691422Z                 op = torch.compile(op)
2025-05-07T20:32:50.4691719Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.4691997Z     
2025-05-07T20:32:50.4692197Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.4692369Z 
2025-05-07T20:32:50.4692550Z moe/activation_test.py:117: 
2025-05-07T20:32:50.4692855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.4693302Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.4693587Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.4694139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.4694699Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.4695351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.4696027Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.4696563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.4697240Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.4697897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.4698418Z     kernel = self.compile(
2025-05-07T20:32:50.4698958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.4699607Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.4699999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.4700230Z 
2025-05-07T20:32:50.4700439Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79fb4fd40>
2025-05-07T20:32:50.4701536Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.4702945Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79fb3b9c0>}
2025-05-07T20:32:50.4704274Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.4705284Z context = <triton._C.libtriton.ir.context object at 0x7fc79f609fb0>
2025-05-07T20:32:50.4705581Z 
2025-05-07T20:32:50.4705745Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.4706267Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.4706734Z                            module_map=module_map)
2025-05-07T20:32:50.4707097Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.4707502Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.4707769Z E       ^
2025-05-07T20:32:50.4708278Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.4708731Z 
2025-05-07T20:32:50.4709142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.4709654Z 
2025-05-07T20:32:50.4709761Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.4710176Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.4710625Z     T=1,
2025-05-07T20:32:50.4710821Z     D=7168,
2025-05-07T20:32:50.4711028Z     scale_ub=None,
2025-05-07T20:32:50.4711247Z     contiguous=False,
2025-05-07T20:32:50.4711483Z     compiled=True,
2025-05-07T20:32:50.4711699Z )
2025-05-07T20:32:50.5585845Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.5586616Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:50.5586987Z 
2025-05-07T20:32:50.5587102Z     @given(
2025-05-07T20:32:50.5587601Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.5587924Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.5588239Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.5588587Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.5588927Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.5589211Z     )
2025-05-07T20:32:50.5589580Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.5590040Z     def test_silu_mul_quant(
2025-05-07T20:32:50.5590284Z         self,
2025-05-07T20:32:50.5590495Z         T: int,
2025-05-07T20:32:50.5590710Z         D: int,
2025-05-07T20:32:50.5590937Z         scale_ub: Optional[float],
2025-05-07T20:32:50.5591227Z         contiguous: bool,
2025-05-07T20:32:50.5591481Z         compiled: bool,
2025-05-07T20:32:50.5591712Z     ) -> None:
2025-05-07T20:32:50.5591940Z         torch.manual_seed(2025)
2025-05-07T20:32:50.5592200Z     
2025-05-07T20:32:50.5592484Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.5592840Z     
2025-05-07T20:32:50.5593047Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.5593350Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.5593671Z         x = x_sign * x_clamp
2025-05-07T20:32:50.5593927Z         x0 = x[:, :D]
2025-05-07T20:32:50.5594156Z         x1 = x[:, D:]
2025-05-07T20:32:50.5594370Z     
2025-05-07T20:32:50.5594571Z         if contiguous:
2025-05-07T20:32:50.5594812Z             x0 = x0.contiguous()
2025-05-07T20:32:50.5595075Z             x1 = x1.contiguous()
2025-05-07T20:32:50.5595329Z     
2025-05-07T20:32:50.5595528Z         if scale_ub is not None:
2025-05-07T20:32:50.5595807Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.5596153Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.5596476Z             )
2025-05-07T20:32:50.5596678Z         else:
2025-05-07T20:32:50.5596906Z             scale_ub_tensor = None
2025-05-07T20:32:50.5597171Z     
2025-05-07T20:32:50.5597405Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.5597730Z             op = silu_mul_quant
2025-05-07T20:32:50.5597995Z             if compiled:
2025-05-07T20:32:50.5598245Z                 op = torch.compile(op)
2025-05-07T20:32:50.5598552Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.5598839Z     
2025-05-07T20:32:50.5599045Z         y_fp8, y_scale = fn()
2025-05-07T20:32:50.5599338Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:50.5599637Z     
2025-05-07T20:32:50.5599891Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.5600227Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:50.5600533Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:50.5600948Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:50.5601386Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:50.5601712Z     
2025-05-07T20:32:50.5601926Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:50.5602121Z 
2025-05-07T20:32:50.5602224Z moe/activation_test.py:126: 
2025-05-07T20:32:50.5602532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.5602887Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:50.5603219Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:50.5604083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:50.5604837Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:50.5605378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.5606061Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.5606794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:50.5607512Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:50.5608232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:50.5608869Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:50.5609477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:50.5609997Z     fn()
2025-05-07T20:32:50.5610501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:50.5611083Z     self.fn.run(
2025-05-07T20:32:50.5611560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.5612085Z     kernel = self.compile(
2025-05-07T20:32:50.5612629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.5613377Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.5613775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.5614003Z 
2025-05-07T20:32:50.5614210Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f6df140>
2025-05-07T20:32:50.5615285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.5616650Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f8ccb80>}
2025-05-07T20:32:50.5617984Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.5618987Z context = <triton._C.libtriton.ir.context object at 0x7fc79f883030>
2025-05-07T20:32:50.5619279Z 
2025-05-07T20:32:50.5619445Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.5619970Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.5620437Z                            module_map=module_map)
2025-05-07T20:32:50.5620798Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.5621164Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:50.5621468Z E       ^
2025-05-07T20:32:50.5621956Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.5622457Z 
2025-05-07T20:32:50.5622928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.5623446Z 
2025-05-07T20:32:50.5623557Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.5623977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.5624374Z     T=1,
2025-05-07T20:32:50.5624563Z     D=5120,
2025-05-07T20:32:50.5624844Z     scale_ub=1200.0,
2025-05-07T20:32:50.5625075Z     contiguous=False,
2025-05-07T20:32:50.5625316Z     compiled=True,
2025-05-07T20:32:50.5625532Z )
2025-05-07T20:32:50.7174626Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.7175353Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:50.7175700Z 
2025-05-07T20:32:50.7175802Z     @given(
2025-05-07T20:32:50.7176045Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.7176361Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.7176948Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.7177287Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.7177623Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.7177916Z     )
2025-05-07T20:32:50.7178268Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.7178713Z     def test_silu_mul_quant(
2025-05-07T20:32:50.7178969Z         self,
2025-05-07T20:32:50.7179168Z         T: int,
2025-05-07T20:32:50.7179377Z         D: int,
2025-05-07T20:32:50.7179598Z         scale_ub: Optional[float],
2025-05-07T20:32:50.7179868Z         contiguous: bool,
2025-05-07T20:32:50.7180111Z         compiled: bool,
2025-05-07T20:32:50.7180341Z     ) -> None:
2025-05-07T20:32:50.7180552Z         torch.manual_seed(2025)
2025-05-07T20:32:50.7180807Z     
2025-05-07T20:32:50.7181088Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.7181442Z     
2025-05-07T20:32:50.7181680Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.7181971Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.7182270Z         x = x_sign * x_clamp
2025-05-07T20:32:50.7182512Z         x0 = x[:, :D]
2025-05-07T20:32:50.7182731Z         x1 = x[:, D:]
2025-05-07T20:32:50.7182947Z     
2025-05-07T20:32:50.7183132Z         if contiguous:
2025-05-07T20:32:50.7183370Z             x0 = x0.contiguous()
2025-05-07T20:32:50.7183632Z             x1 = x1.contiguous()
2025-05-07T20:32:50.7183868Z     
2025-05-07T20:32:50.7184063Z         if scale_ub is not None:
2025-05-07T20:32:50.7184340Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.7184672Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.7184986Z             )
2025-05-07T20:32:50.7185184Z         else:
2025-05-07T20:32:50.7185406Z             scale_ub_tensor = None
2025-05-07T20:32:50.7185663Z     
2025-05-07T20:32:50.7185899Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.7186210Z             op = silu_mul_quant
2025-05-07T20:32:50.7186467Z             if compiled:
2025-05-07T20:32:50.7186721Z                 op = torch.compile(op)
2025-05-07T20:32:50.7187017Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.7187298Z     
2025-05-07T20:32:50.7187495Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.7187658Z 
2025-05-07T20:32:50.7187762Z moe/activation_test.py:117: 
2025-05-07T20:32:50.7188057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.7188386Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.7188669Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.7189221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.7189862Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.7190594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.7191275Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.7191798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.7192468Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.7193128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.7193721Z     kernel = self.compile(
2025-05-07T20:32:50.7194260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.7194906Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.7195297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.7195525Z 
2025-05-07T20:32:50.7195734Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f6de9c0>
2025-05-07T20:32:50.7196842Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.7198202Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f8cde40>}
2025-05-07T20:32:50.7199521Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.7200526Z context = <triton._C.libtriton.ir.context object at 0x7fc79f2585b0>
2025-05-07T20:32:50.7200812Z 
2025-05-07T20:32:50.7200975Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.7201511Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.7202017Z                            module_map=module_map)
2025-05-07T20:32:50.7202373Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.7202723Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.7202980Z E       ^
2025-05-07T20:32:50.7203433Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.7203884Z 
2025-05-07T20:32:50.7204293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.7204799Z 
2025-05-07T20:32:50.7204902Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.7205310Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.7205703Z     T=1,
2025-05-07T20:32:50.7205891Z     D=5120,
2025-05-07T20:32:50.7206089Z     scale_ub=1200.0,
2025-05-07T20:32:50.7206313Z     contiguous=False,
2025-05-07T20:32:50.7206543Z     compiled=False,
2025-05-07T20:32:50.7206754Z )
2025-05-07T20:32:50.7207077Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.7207553Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:50.7207821Z 
2025-05-07T20:32:50.7207898Z     @given(
2025-05-07T20:32:50.7208128Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.7208439Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.7208753Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.7209085Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.7209410Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.7209698Z     )
2025-05-07T20:32:50.7210052Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.7210545Z     def test_silu_mul_quant(
2025-05-07T20:32:50.7210827Z         self,
2025-05-07T20:32:50.7211034Z         T: int,
2025-05-07T20:32:50.7211235Z         D: int,
2025-05-07T20:32:50.7211455Z         scale_ub: Optional[float],
2025-05-07T20:32:50.7211733Z         contiguous: bool,
2025-05-07T20:32:50.7211976Z         compiled: bool,
2025-05-07T20:32:50.7212196Z     ) -> None:
2025-05-07T20:32:50.7212423Z         torch.manual_seed(2025)
2025-05-07T20:32:50.7212671Z     
2025-05-07T20:32:50.7213115Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.7213463Z     
2025-05-07T20:32:50.7213661Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.7213952Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.7214268Z         x = x_sign * x_clamp
2025-05-07T20:32:50.7214517Z         x0 = x[:, :D]
2025-05-07T20:32:50.7214728Z         x1 = x[:, D:]
2025-05-07T20:32:50.7214942Z     
2025-05-07T20:32:50.7215134Z         if contiguous:
2025-05-07T20:32:50.7215365Z             x0 = x0.contiguous()
2025-05-07T20:32:50.7215679Z             x1 = x1.contiguous()
2025-05-07T20:32:50.7215926Z     
2025-05-07T20:32:50.7216122Z         if scale_ub is not None:
2025-05-07T20:32:50.7216394Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.7216736Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.7217045Z             )
2025-05-07T20:32:50.7217242Z         else:
2025-05-07T20:32:50.7217462Z             scale_ub_tensor = None
2025-05-07T20:32:50.7217725Z     
2025-05-07T20:32:50.7217957Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.7218277Z             op = silu_mul_quant
2025-05-07T20:32:50.7218533Z             if compiled:
2025-05-07T20:32:50.7218785Z                 op = torch.compile(op)
2025-05-07T20:32:50.7219092Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.7219374Z     
2025-05-07T20:32:50.7219572Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.7219748Z 
2025-05-07T20:32:50.7219849Z moe/activation_test.py:117: 
2025-05-07T20:32:50.7220155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.7220491Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.7220777Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.7221470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.7222159Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.7222700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.7223383Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.7224057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.7224599Z     kernel = self.compile(
2025-05-07T20:32:50.7225137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.7225796Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.7226200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.7226425Z 
2025-05-07T20:32:50.7226640Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f6de300>
2025-05-07T20:32:50.7227698Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.7229050Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f8ceac0>}
2025-05-07T20:32:50.7230463Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.7231509Z context = <triton._C.libtriton.ir.context object at 0x7fc79fff78f0>
2025-05-07T20:32:50.7231818Z 
2025-05-07T20:32:50.7231981Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.7232504Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.7233021Z                            module_map=module_map)
2025-05-07T20:32:50.7233388Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.7233743Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.7234005Z E       ^
2025-05-07T20:32:50.7234472Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.7234919Z 
2025-05-07T20:32:50.7235332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.7235848Z 
2025-05-07T20:32:50.7235998Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.7236421Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.7236817Z     T=16384,
2025-05-07T20:32:50.7237008Z     D=5120,
2025-05-07T20:32:50.7237207Z     scale_ub=1200.0,
2025-05-07T20:32:50.7237435Z     contiguous=False,
2025-05-07T20:32:50.7237656Z     compiled=True,
2025-05-07T20:32:50.7237863Z )
2025-05-07T20:32:50.8114303Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.8115056Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:50.8115432Z 
2025-05-07T20:32:50.8115543Z     @given(
2025-05-07T20:32:50.8115841Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.8116273Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.8116670Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.8117071Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.8117402Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.8117693Z     )
2025-05-07T20:32:50.8118047Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.8118487Z     def test_silu_mul_quant(
2025-05-07T20:32:50.8118734Z         self,
2025-05-07T20:32:50.8118942Z         T: int,
2025-05-07T20:32:50.8119148Z         D: int,
2025-05-07T20:32:50.8119380Z         scale_ub: Optional[float],
2025-05-07T20:32:50.8119680Z         contiguous: bool,
2025-05-07T20:32:50.8119932Z         compiled: bool,
2025-05-07T20:32:50.8120168Z     ) -> None:
2025-05-07T20:32:50.8120392Z         torch.manual_seed(2025)
2025-05-07T20:32:50.8128533Z     
2025-05-07T20:32:50.8128822Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.8129167Z     
2025-05-07T20:32:50.8129365Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.8129659Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.8129977Z         x = x_sign * x_clamp
2025-05-07T20:32:50.8130218Z         x0 = x[:, :D]
2025-05-07T20:32:50.8130419Z         x1 = x[:, D:]
2025-05-07T20:32:50.8130629Z     
2025-05-07T20:32:50.8130816Z         if contiguous:
2025-05-07T20:32:50.8131049Z             x0 = x0.contiguous()
2025-05-07T20:32:50.8131311Z             x1 = x1.contiguous()
2025-05-07T20:32:50.8131559Z     
2025-05-07T20:32:50.8131754Z         if scale_ub is not None:
2025-05-07T20:32:50.8132036Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.8132382Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.8132694Z             )
2025-05-07T20:32:50.8132890Z         else:
2025-05-07T20:32:50.8133171Z             scale_ub_tensor = None
2025-05-07T20:32:50.8133427Z     
2025-05-07T20:32:50.8133925Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.8134249Z             op = silu_mul_quant
2025-05-07T20:32:50.8134504Z             if compiled:
2025-05-07T20:32:50.8134831Z                 op = torch.compile(op)
2025-05-07T20:32:50.8135137Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.8135413Z     
2025-05-07T20:32:50.8135604Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.8135783Z 
2025-05-07T20:32:50.8135882Z moe/activation_test.py:117: 
2025-05-07T20:32:50.8136181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.8136596Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.8136877Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.8137431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.8137994Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.8138635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.8139319Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.8139927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.8140599Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.8141247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.8141819Z     kernel = self.compile(
2025-05-07T20:32:50.8142364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.8143001Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.8143397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.8143633Z 
2025-05-07T20:32:50.8143838Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79ff710d0>
2025-05-07T20:32:50.8144914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.8146270Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79ff4c180>}
2025-05-07T20:32:50.8147598Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.8148610Z context = <triton._C.libtriton.ir.context object at 0x7fc79ffe8970>
2025-05-07T20:32:50.8148894Z 
2025-05-07T20:32:50.8149065Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.8149585Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.8150043Z                            module_map=module_map)
2025-05-07T20:32:50.8150413Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.8150771Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.8151029Z E       ^
2025-05-07T20:32:50.8151497Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.8151938Z 
2025-05-07T20:32:50.8152354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.8152860Z 
2025-05-07T20:32:50.8152973Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.8153377Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.8153777Z     T=2048,
2025-05-07T20:32:50.8153969Z     D=7168,
2025-05-07T20:32:50.8154213Z     scale_ub=1200.0,
2025-05-07T20:32:50.8154441Z     contiguous=False,
2025-05-07T20:32:50.8154668Z     compiled=True,
2025-05-07T20:32:50.8154870Z )
2025-05-07T20:32:50.8155236Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.8155732Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:50.8156000Z 
2025-05-07T20:32:50.8156088Z     @given(
2025-05-07T20:32:50.8156317Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.8156631Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.8157000Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.8157325Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.8157654Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.8157938Z     )
2025-05-07T20:32:50.8158278Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.8158720Z     def test_silu_mul_quant(
2025-05-07T20:32:50.8158962Z         self,
2025-05-07T20:32:50.8159158Z         T: int,
2025-05-07T20:32:50.8159758Z         D: int,
2025-05-07T20:32:50.8160051Z         scale_ub: Optional[float],
2025-05-07T20:32:50.8160322Z         contiguous: bool,
2025-05-07T20:32:50.8160568Z         compiled: bool,
2025-05-07T20:32:50.8160797Z     ) -> None:
2025-05-07T20:32:50.8161012Z         torch.manual_seed(2025)
2025-05-07T20:32:50.8161252Z     
2025-05-07T20:32:50.8161526Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.8161872Z     
2025-05-07T20:32:50.8162069Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.8162361Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.8162676Z         x = x_sign * x_clamp
2025-05-07T20:32:50.8162920Z         x0 = x[:, :D]
2025-05-07T20:32:50.8163144Z         x1 = x[:, D:]
2025-05-07T20:32:50.8163352Z     
2025-05-07T20:32:50.8163543Z         if contiguous:
2025-05-07T20:32:50.8163783Z             x0 = x0.contiguous()
2025-05-07T20:32:50.8164051Z             x1 = x1.contiguous()
2025-05-07T20:32:50.8164294Z     
2025-05-07T20:32:50.8164500Z         if scale_ub is not None:
2025-05-07T20:32:50.8164785Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.8165115Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.8165431Z             )
2025-05-07T20:32:50.8165628Z         else:
2025-05-07T20:32:50.8165836Z             scale_ub_tensor = None
2025-05-07T20:32:50.8166093Z     
2025-05-07T20:32:50.8166331Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.8166651Z             op = silu_mul_quant
2025-05-07T20:32:50.8166897Z             if compiled:
2025-05-07T20:32:50.8167150Z                 op = torch.compile(op)
2025-05-07T20:32:50.8167454Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.8167724Z     
2025-05-07T20:32:50.8167924Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.8168090Z 
2025-05-07T20:32:50.8168198Z moe/activation_test.py:117: 
2025-05-07T20:32:50.8168494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.8168838Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.8169131Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.8169686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.8170250Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.8170907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.8171614Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.8172183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.8172862Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.8173607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.8174204Z     kernel = self.compile(
2025-05-07T20:32:50.8174794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.8175443Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.8175842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.8176068Z 
2025-05-07T20:32:50.8176274Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79ff73e30>
2025-05-07T20:32:50.8177411Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.8178767Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79ff4cea0>}
2025-05-07T20:32:50.8180156Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.8181170Z context = <triton._C.libtriton.ir.context object at 0x7fc79ffe4b70>
2025-05-07T20:32:50.8181460Z 
2025-05-07T20:32:50.8181628Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.8182191Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.8182658Z                            module_map=module_map)
2025-05-07T20:32:50.8183029Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.8183380Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.8183646Z E       ^
2025-05-07T20:32:50.8184109Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.8184556Z 
2025-05-07T20:32:50.8184980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.8185492Z 
2025-05-07T20:32:50.9337691Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.9338409Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.9338954Z     T=1,
2025-05-07T20:32:50.9339212Z     D=5120,
2025-05-07T20:32:50.9339413Z     scale_ub=None,
2025-05-07T20:32:50.9339635Z     contiguous=False,
2025-05-07T20:32:50.9339899Z     compiled=False,
2025-05-07T20:32:50.9340113Z )
2025-05-07T20:32:50.9340428Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.9340920Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:50.9341179Z 
2025-05-07T20:32:50.9341265Z     @given(
2025-05-07T20:32:50.9341497Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.9341827Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.9342139Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.9342477Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.9342803Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.9343101Z     )
2025-05-07T20:32:50.9343455Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.9343906Z     def test_silu_mul_quant(
2025-05-07T20:32:50.9344152Z         self,
2025-05-07T20:32:50.9344355Z         T: int,
2025-05-07T20:32:50.9344561Z         D: int,
2025-05-07T20:32:50.9344781Z         scale_ub: Optional[float],
2025-05-07T20:32:50.9345052Z         contiguous: bool,
2025-05-07T20:32:50.9345296Z         compiled: bool,
2025-05-07T20:32:50.9345520Z     ) -> None:
2025-05-07T20:32:50.9345746Z         torch.manual_seed(2025)
2025-05-07T20:32:50.9345993Z     
2025-05-07T20:32:50.9346520Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.9346866Z     
2025-05-07T20:32:50.9347065Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.9347444Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.9347756Z         x = x_sign * x_clamp
2025-05-07T20:32:50.9348001Z         x0 = x[:, :D]
2025-05-07T20:32:50.9348218Z         x1 = x[:, D:]
2025-05-07T20:32:50.9348432Z     
2025-05-07T20:32:50.9348622Z         if contiguous:
2025-05-07T20:32:50.9348853Z             x0 = x0.contiguous()
2025-05-07T20:32:50.9349114Z             x1 = x1.contiguous()
2025-05-07T20:32:50.9349467Z     
2025-05-07T20:32:50.9349656Z         if scale_ub is not None:
2025-05-07T20:32:50.9349929Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.9350267Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.9350578Z             )
2025-05-07T20:32:50.9350773Z         else:
2025-05-07T20:32:50.9350991Z             scale_ub_tensor = None
2025-05-07T20:32:50.9351246Z     
2025-05-07T20:32:50.9351472Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.9351788Z             op = silu_mul_quant
2025-05-07T20:32:50.9352131Z             if compiled:
2025-05-07T20:32:50.9352382Z                 op = torch.compile(op)
2025-05-07T20:32:50.9352688Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.9352964Z     
2025-05-07T20:32:50.9353164Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.9353338Z 
2025-05-07T20:32:50.9353439Z moe/activation_test.py:117: 
2025-05-07T20:32:50.9353741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9354075Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.9354362Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.9355049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.9355731Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.9356268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.9356950Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.9357609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.9358137Z     kernel = self.compile(
2025-05-07T20:32:50.9358680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.9359629Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.9360030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9360256Z 
2025-05-07T20:32:50.9360465Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79ff732c0>
2025-05-07T20:32:50.9361532Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.9362900Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79ff4de40>}
2025-05-07T20:32:50.9364218Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.9365230Z context = <triton._C.libtriton.ir.context object at 0x7fc88a0d31f0>
2025-05-07T20:32:50.9365512Z 
2025-05-07T20:32:50.9365678Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.9366193Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.9366733Z                            module_map=module_map)
2025-05-07T20:32:50.9367097Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.9367513Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.9367783Z E       ^
2025-05-07T20:32:50.9368253Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.9368700Z 
2025-05-07T20:32:50.9369113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.9369624Z 
2025-05-07T20:32:50.9369792Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.9370209Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.9370603Z     T=4096,
2025-05-07T20:32:50.9370800Z     D=7168,
2025-05-07T20:32:50.9370998Z     scale_ub=1200.0,
2025-05-07T20:32:50.9371231Z     contiguous=False,
2025-05-07T20:32:50.9371456Z     compiled=False,
2025-05-07T20:32:50.9371669Z )
2025-05-07T20:32:50.9371995Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.9372548Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:50.9372829Z 
2025-05-07T20:32:50.9372910Z     @given(
2025-05-07T20:32:50.9373235Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.9373547Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.9373859Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.9374190Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.9374517Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.9374805Z     )
2025-05-07T20:32:50.9375158Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.9375600Z     def test_silu_mul_quant(
2025-05-07T20:32:50.9375841Z         self,
2025-05-07T20:32:50.9376042Z         T: int,
2025-05-07T20:32:50.9376248Z         D: int,
2025-05-07T20:32:50.9376476Z         scale_ub: Optional[float],
2025-05-07T20:32:50.9376759Z         contiguous: bool,
2025-05-07T20:32:50.9377013Z         compiled: bool,
2025-05-07T20:32:50.9377241Z     ) -> None:
2025-05-07T20:32:50.9377469Z         torch.manual_seed(2025)
2025-05-07T20:32:50.9377720Z     
2025-05-07T20:32:50.9377990Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.9378343Z     
2025-05-07T20:32:50.9378543Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.9378836Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.9379154Z         x = x_sign * x_clamp
2025-05-07T20:32:50.9379401Z         x0 = x[:, :D]
2025-05-07T20:32:50.9379615Z         x1 = x[:, D:]
2025-05-07T20:32:50.9379829Z     
2025-05-07T20:32:50.9380019Z         if contiguous:
2025-05-07T20:32:50.9380257Z             x0 = x0.contiguous()
2025-05-07T20:32:50.9380522Z             x1 = x1.contiguous()
2025-05-07T20:32:50.9380769Z     
2025-05-07T20:32:50.9380972Z         if scale_ub is not None:
2025-05-07T20:32:50.9381245Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.9381590Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.9381909Z             )
2025-05-07T20:32:50.9382107Z         else:
2025-05-07T20:32:50.9382329Z             scale_ub_tensor = None
2025-05-07T20:32:50.9382593Z     
2025-05-07T20:32:50.9382826Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.9383152Z             op = silu_mul_quant
2025-05-07T20:32:50.9383412Z             if compiled:
2025-05-07T20:32:50.9383666Z                 op = torch.compile(op)
2025-05-07T20:32:50.9383969Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.9384253Z     
2025-05-07T20:32:50.9384442Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.9384612Z 
2025-05-07T20:32:50.9384715Z moe/activation_test.py:117: 
2025-05-07T20:32:50.9385013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9385399Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.9385677Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.9386404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.9387093Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.9387625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.9388312Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.9389013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.9389553Z     kernel = self.compile(
2025-05-07T20:32:50.9390092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.9390750Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.9391163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9391393Z 
2025-05-07T20:32:50.9391677Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79ff72d80>
2025-05-07T20:32:50.9392767Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.9394127Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79ff4f380>}
2025-05-07T20:32:50.9395456Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.9396478Z context = <triton._C.libtriton.ir.context object at 0x7fc79f274a70>
2025-05-07T20:32:50.9396766Z 
2025-05-07T20:32:50.9396937Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.9397465Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.9397946Z                            module_map=module_map)
2025-05-07T20:32:50.9398323Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.9398681Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.9398953Z E       ^
2025-05-07T20:32:50.9399428Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.9399874Z 
2025-05-07T20:32:50.9400293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.9400807Z 
2025-05-07T20:32:50.9400918Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.9401341Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.9401747Z     T=16384,
2025-05-07T20:32:50.9401943Z     D=7168,
2025-05-07T20:32:50.9402147Z     scale_ub=None,
2025-05-07T20:32:50.9402373Z     contiguous=True,
2025-05-07T20:32:50.9402600Z     compiled=True,
2025-05-07T20:32:50.9402815Z )
2025-05-07T20:32:51.1149412Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.1150082Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:51.1150360Z 
2025-05-07T20:32:51.1150460Z     @given(
2025-05-07T20:32:51.1150707Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.1151037Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.1151347Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.1151675Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.1152006Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.1152556Z     )
2025-05-07T20:32:51.1152903Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.1153442Z     def test_silu_mul_quant(
2025-05-07T20:32:51.1153698Z         self,
2025-05-07T20:32:51.1153887Z         T: int,
2025-05-07T20:32:51.1154086Z         D: int,
2025-05-07T20:32:51.1154311Z         scale_ub: Optional[float],
2025-05-07T20:32:51.1154575Z         contiguous: bool,
2025-05-07T20:32:51.1154816Z         compiled: bool,
2025-05-07T20:32:51.1155051Z     ) -> None:
2025-05-07T20:32:51.1155266Z         torch.manual_seed(2025)
2025-05-07T20:32:51.1155597Z     
2025-05-07T20:32:51.1155876Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.1156217Z     
2025-05-07T20:32:51.1156414Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.1156708Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.1157021Z         x = x_sign * x_clamp
2025-05-07T20:32:51.1157264Z         x0 = x[:, :D]
2025-05-07T20:32:51.1157485Z         x1 = x[:, D:]
2025-05-07T20:32:51.1157698Z     
2025-05-07T20:32:51.1157883Z         if contiguous:
2025-05-07T20:32:51.1158196Z             x0 = x0.contiguous()
2025-05-07T20:32:51.1158461Z             x1 = x1.contiguous()
2025-05-07T20:32:51.1158694Z     
2025-05-07T20:32:51.1158889Z         if scale_ub is not None:
2025-05-07T20:32:51.1159162Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.1159768Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.1160079Z             )
2025-05-07T20:32:51.1160280Z         else:
2025-05-07T20:32:51.1160491Z             scale_ub_tensor = None
2025-05-07T20:32:51.1160749Z     
2025-05-07T20:32:51.1160985Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.1161296Z             op = silu_mul_quant
2025-05-07T20:32:51.1161554Z             if compiled:
2025-05-07T20:32:51.1161816Z                 op = torch.compile(op)
2025-05-07T20:32:51.1162116Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.1162385Z     
2025-05-07T20:32:51.1162582Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.1162749Z 
2025-05-07T20:32:51.1162860Z moe/activation_test.py:117: 
2025-05-07T20:32:51.1163152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.1163479Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.1163761Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.1164313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.1164871Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.1165534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.1166215Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.1166748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.1167428Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.1168094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.1168638Z     kernel = self.compile(
2025-05-07T20:32:51.1176440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.1177108Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.1177519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.1177752Z 
2025-05-07T20:32:51.1177968Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a3ac6b0>
2025-05-07T20:32:51.1179035Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.1180583Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a3f44a0>}
2025-05-07T20:32:51.1181908Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.1182920Z context = <triton._C.libtriton.ir.context object at 0x7fc88a41c030>
2025-05-07T20:32:51.1183272Z 
2025-05-07T20:32:51.1183447Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.1183957Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.1184423Z                            module_map=module_map)
2025-05-07T20:32:51.1184790Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.1185141Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.1185407Z E       ^
2025-05-07T20:32:51.1185928Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.1186377Z 
2025-05-07T20:32:51.1186799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.1187303Z 
2025-05-07T20:32:51.1187409Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.1187820Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.1188230Z     T=4096,
2025-05-07T20:32:51.1188414Z     D=5120,
2025-05-07T20:32:51.1188615Z     scale_ub=None,
2025-05-07T20:32:51.1188837Z     contiguous=False,
2025-05-07T20:32:51.1189061Z     compiled=True,
2025-05-07T20:32:51.1189273Z )
2025-05-07T20:32:51.1189591Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.1190085Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:51.1190352Z 
2025-05-07T20:32:51.1190431Z     @given(
2025-05-07T20:32:51.1190671Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.1190987Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.1191287Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.1191623Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.1191996Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.1192272Z     )
2025-05-07T20:32:51.1192628Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.1193065Z     def test_silu_mul_quant(
2025-05-07T20:32:51.1193307Z         self,
2025-05-07T20:32:51.1193498Z         T: int,
2025-05-07T20:32:51.1193699Z         D: int,
2025-05-07T20:32:51.1193917Z         scale_ub: Optional[float],
2025-05-07T20:32:51.1194182Z         contiguous: bool,
2025-05-07T20:32:51.1194434Z         compiled: bool,
2025-05-07T20:32:51.1194657Z     ) -> None:
2025-05-07T20:32:51.1194871Z         torch.manual_seed(2025)
2025-05-07T20:32:51.1195120Z     
2025-05-07T20:32:51.1195402Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.1195741Z     
2025-05-07T20:32:51.1195943Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.1196243Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.1196554Z         x = x_sign * x_clamp
2025-05-07T20:32:51.1196799Z         x0 = x[:, :D]
2025-05-07T20:32:51.1197022Z         x1 = x[:, D:]
2025-05-07T20:32:51.1197231Z     
2025-05-07T20:32:51.1197426Z         if contiguous:
2025-05-07T20:32:51.1197663Z             x0 = x0.contiguous()
2025-05-07T20:32:51.1197917Z             x1 = x1.contiguous()
2025-05-07T20:32:51.1198161Z     
2025-05-07T20:32:51.1198359Z         if scale_ub is not None:
2025-05-07T20:32:51.1198638Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.1199040Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.1199356Z             )
2025-05-07T20:32:51.1199556Z         else:
2025-05-07T20:32:51.1199810Z             scale_ub_tensor = None
2025-05-07T20:32:51.1200069Z     
2025-05-07T20:32:51.1200307Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.1200615Z             op = silu_mul_quant
2025-05-07T20:32:51.1200872Z             if compiled:
2025-05-07T20:32:51.1201123Z                 op = torch.compile(op)
2025-05-07T20:32:51.1201412Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.1201732Z     
2025-05-07T20:32:51.1201934Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.1202123Z 
2025-05-07T20:32:51.1202246Z moe/activation_test.py:117: 
2025-05-07T20:32:51.1202548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.1202880Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.1203163Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.1203714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.1204273Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.1204973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.1205646Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.1206177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.1206849Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.1207509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.1208027Z     kernel = self.compile(
2025-05-07T20:32:51.1208564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.1209212Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.1209610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.1209835Z 
2025-05-07T20:32:51.1210042Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a3ad040>
2025-05-07T20:32:51.1211108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.1212517Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a3f51c0>}
2025-05-07T20:32:51.1213917Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.1214923Z context = <triton._C.libtriton.ir.context object at 0x7fc79fc1c070>
2025-05-07T20:32:51.1215217Z 
2025-05-07T20:32:51.1215388Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.1215907Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.1216389Z                            module_map=module_map)
2025-05-07T20:32:51.1216753Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.1217126Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.1217400Z E       ^
2025-05-07T20:32:51.1217857Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.1218310Z 
2025-05-07T20:32:51.1218720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.1219279Z 
2025-05-07T20:32:51.4291192Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4291793Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4292510Z     T=4096,
2025-05-07T20:32:51.4292706Z     D=5120,
2025-05-07T20:32:51.4292908Z     scale_ub=1200.0,
2025-05-07T20:32:51.4293265Z     contiguous=False,
2025-05-07T20:32:51.4293498Z     compiled=False,
2025-05-07T20:32:51.4293717Z )
2025-05-07T20:32:51.4294050Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4294550Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.4294925Z 
2025-05-07T20:32:51.4295008Z     @given(
2025-05-07T20:32:51.4295252Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4295568Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4295874Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4296210Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4296545Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4296828Z     )
2025-05-07T20:32:51.4297270Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4297717Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4297960Z         self,
2025-05-07T20:32:51.4298167Z         T: int,
2025-05-07T20:32:51.4298375Z         D: int,
2025-05-07T20:32:51.4298595Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4298879Z         contiguous: bool,
2025-05-07T20:32:51.4299127Z         compiled: bool,
2025-05-07T20:32:51.4299359Z     ) -> None:
2025-05-07T20:32:51.4299581Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4299831Z     
2025-05-07T20:32:51.4300101Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4300454Z     
2025-05-07T20:32:51.4300657Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4300954Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4301263Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4301520Z         x0 = x[:, :D]
2025-05-07T20:32:51.4301742Z         x1 = x[:, D:]
2025-05-07T20:32:51.4301989Z     
2025-05-07T20:32:51.4302182Z         if contiguous:
2025-05-07T20:32:51.4302423Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4302695Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4302951Z     
2025-05-07T20:32:51.4303153Z         if scale_ub is not None:
2025-05-07T20:32:51.4303441Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4303789Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4304098Z             )
2025-05-07T20:32:51.4304306Z         else:
2025-05-07T20:32:51.4304533Z             scale_ub_tensor = None
2025-05-07T20:32:51.4304791Z     
2025-05-07T20:32:51.4305036Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4305358Z             op = silu_mul_quant
2025-05-07T20:32:51.4305613Z             if compiled:
2025-05-07T20:32:51.4305879Z                 op = torch.compile(op)
2025-05-07T20:32:51.4306189Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4306477Z     
2025-05-07T20:32:51.4306692Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4306870Z 
2025-05-07T20:32:51.4306975Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4307289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4307630Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4307932Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4308629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4309335Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4309880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4310571Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4311324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4311910Z     kernel = self.compile(
2025-05-07T20:32:51.4312456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4313109Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4313514Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4313741Z 
2025-05-07T20:32:51.4314026Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a3af860>
2025-05-07T20:32:51.4315099Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4316474Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a3f6160>}
2025-05-07T20:32:51.4317850Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4318866Z context = <triton._C.libtriton.ir.context object at 0x7fc79fcf2430>
2025-05-07T20:32:51.4319157Z 
2025-05-07T20:32:51.4319326Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4319849Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4320318Z                            module_map=module_map)
2025-05-07T20:32:51.4320680Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4321040Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4321303Z E       ^
2025-05-07T20:32:51.4321773Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4322222Z 
2025-05-07T20:32:51.4322644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4323157Z 
2025-05-07T20:32:51.4323265Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4323684Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4324088Z     T=4096,
2025-05-07T20:32:51.4324277Z     D=5120,
2025-05-07T20:32:51.4324482Z     scale_ub=1200.0,
2025-05-07T20:32:51.4324710Z     contiguous=False,
2025-05-07T20:32:51.4324936Z     compiled=True,
2025-05-07T20:32:51.4325141Z )
2025-05-07T20:32:51.4325476Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4325985Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.4326272Z 
2025-05-07T20:32:51.4326353Z     @given(
2025-05-07T20:32:51.4326594Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4326917Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4327234Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4327574Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4327907Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4328201Z     )
2025-05-07T20:32:51.4328558Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4329002Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4329241Z         self,
2025-05-07T20:32:51.4329443Z         T: int,
2025-05-07T20:32:51.4329647Z         D: int,
2025-05-07T20:32:51.4329863Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4330145Z         contiguous: bool,
2025-05-07T20:32:51.4330390Z         compiled: bool,
2025-05-07T20:32:51.4330612Z     ) -> None:
2025-05-07T20:32:51.4330830Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4331134Z     
2025-05-07T20:32:51.4331404Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4331788Z     
2025-05-07T20:32:51.4331992Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4332285Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4332598Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4332840Z         x0 = x[:, :D]
2025-05-07T20:32:51.4333129Z         x1 = x[:, D:]
2025-05-07T20:32:51.4333341Z     
2025-05-07T20:32:51.4333534Z         if contiguous:
2025-05-07T20:32:51.4333809Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4334072Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4334316Z     
2025-05-07T20:32:51.4334506Z         if scale_ub is not None:
2025-05-07T20:32:51.4334783Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4335120Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4335428Z             )
2025-05-07T20:32:51.4335620Z         else:
2025-05-07T20:32:51.4335839Z             scale_ub_tensor = None
2025-05-07T20:32:51.4336099Z     
2025-05-07T20:32:51.4336374Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4336699Z             op = silu_mul_quant
2025-05-07T20:32:51.4336955Z             if compiled:
2025-05-07T20:32:51.4337201Z                 op = torch.compile(op)
2025-05-07T20:32:51.4337499Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4337783Z     
2025-05-07T20:32:51.4337972Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4338144Z 
2025-05-07T20:32:51.4338247Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4338548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4338881Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4339160Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4339719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4340284Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4340941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4341636Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4342217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4342896Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4343552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4344084Z     kernel = self.compile(
2025-05-07T20:32:51.4344630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4345280Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4345685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4345922Z 
2025-05-07T20:32:51.4346137Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a3af410>
2025-05-07T20:32:51.4347204Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4348557Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a3f7240>}
2025-05-07T20:32:51.4349891Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4350905Z context = <triton._C.libtriton.ir.context object at 0x7fc79f22e2f0>
2025-05-07T20:32:51.4351247Z 
2025-05-07T20:32:51.4351423Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4351981Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4352446Z                            module_map=module_map)
2025-05-07T20:32:51.4352811Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4353166Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4353422Z E       ^
2025-05-07T20:32:51.4353884Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4354371Z 
2025-05-07T20:32:51.4354797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4355301Z 
2025-05-07T20:32:51.5507288Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.5507764Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.5508189Z     T=2048,
2025-05-07T20:32:51.5508388Z     D=7168,
2025-05-07T20:32:51.5508591Z     scale_ub=1200.0,
2025-05-07T20:32:51.5508987Z     contiguous=False,
2025-05-07T20:32:51.5509231Z     compiled=False,
2025-05-07T20:32:51.5509440Z )
2025-05-07T20:32:51.5509768Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.5510270Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.5510547Z 
2025-05-07T20:32:51.5510636Z     @given(
2025-05-07T20:32:51.5510875Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.5511196Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.5511506Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.5511856Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.5512229Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.5512519Z     )
2025-05-07T20:32:51.5512868Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.5513315Z     def test_silu_mul_quant(
2025-05-07T20:32:51.5513571Z         self,
2025-05-07T20:32:51.5513780Z         T: int,
2025-05-07T20:32:51.5513987Z         D: int,
2025-05-07T20:32:51.5514220Z         scale_ub: Optional[float],
2025-05-07T20:32:51.5514498Z         contiguous: bool,
2025-05-07T20:32:51.5514739Z         compiled: bool,
2025-05-07T20:32:51.5514976Z     ) -> None:
2025-05-07T20:32:51.5515198Z         torch.manual_seed(2025)
2025-05-07T20:32:51.5515439Z     
2025-05-07T20:32:51.5515724Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.5516078Z     
2025-05-07T20:32:51.5516275Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.5516576Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.5516893Z         x = x_sign * x_clamp
2025-05-07T20:32:51.5517138Z         x0 = x[:, :D]
2025-05-07T20:32:51.5517372Z         x1 = x[:, D:]
2025-05-07T20:32:51.5517594Z     
2025-05-07T20:32:51.5517788Z         if contiguous:
2025-05-07T20:32:51.5518036Z             x0 = x0.contiguous()
2025-05-07T20:32:51.5518304Z             x1 = x1.contiguous()
2025-05-07T20:32:51.5518549Z     
2025-05-07T20:32:51.5518755Z         if scale_ub is not None:
2025-05-07T20:32:51.5519043Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.5519395Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.5519714Z             )
2025-05-07T20:32:51.5519920Z         else:
2025-05-07T20:32:51.5520154Z             scale_ub_tensor = None
2025-05-07T20:32:51.5520416Z     
2025-05-07T20:32:51.5520657Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.5520981Z             op = silu_mul_quant
2025-05-07T20:32:51.5521240Z             if compiled:
2025-05-07T20:32:51.5521500Z                 op = torch.compile(op)
2025-05-07T20:32:51.5521809Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.5522211Z     
2025-05-07T20:32:51.5522433Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.5522602Z 
2025-05-07T20:32:51.5522787Z moe/activation_test.py:117: 
2025-05-07T20:32:51.5523097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.5523439Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.5523735Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.5524431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.5525190Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.5525731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.5526419Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.5527076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.5527620Z     kernel = self.compile(
2025-05-07T20:32:51.5528220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.5528876Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.5529276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.5529515Z 
2025-05-07T20:32:51.5529723Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f296000>
2025-05-07T20:32:51.5530796Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.5532164Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f230220>}
2025-05-07T20:32:51.5533626Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.5534639Z context = <triton._C.libtriton.ir.context object at 0x7fc79f226a70>
2025-05-07T20:32:51.5534935Z 
2025-05-07T20:32:51.5535104Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.5535628Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.5536102Z                            module_map=module_map)
2025-05-07T20:32:51.5536468Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.5536827Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.5537097Z E       ^
2025-05-07T20:32:51.5537556Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.5538009Z 
2025-05-07T20:32:51.5538425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.5538937Z 
2025-05-07T20:32:51.5539053Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.5539471Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.5539876Z     T=1,
2025-05-07T20:32:51.5540066Z     D=7168,
2025-05-07T20:32:51.5540269Z     scale_ub=None,
2025-05-07T20:32:51.5540484Z     contiguous=True,
2025-05-07T20:32:51.5540719Z     compiled=False,
2025-05-07T20:32:51.5540945Z )
2025-05-07T20:32:51.5541266Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.5541767Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.5542069Z 
2025-05-07T20:32:51.5542159Z     @given(
2025-05-07T20:32:51.5542396Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.5542768Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.5543082Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.5543494Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.5543827Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.5544122Z     )
2025-05-07T20:32:51.5544481Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.5544925Z     def test_silu_mul_quant(
2025-05-07T20:32:51.5545176Z         self,
2025-05-07T20:32:51.5545381Z         T: int,
2025-05-07T20:32:51.5545621Z         D: int,
2025-05-07T20:32:51.5545846Z         scale_ub: Optional[float],
2025-05-07T20:32:51.5546125Z         contiguous: bool,
2025-05-07T20:32:51.5553223Z         compiled: bool,
2025-05-07T20:32:51.5553496Z     ) -> None:
2025-05-07T20:32:51.5553720Z         torch.manual_seed(2025)
2025-05-07T20:32:51.5553975Z     
2025-05-07T20:32:51.5554260Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.5554605Z     
2025-05-07T20:32:51.5554813Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.5555191Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.5555506Z         x = x_sign * x_clamp
2025-05-07T20:32:51.5555746Z         x0 = x[:, :D]
2025-05-07T20:32:51.5555969Z         x1 = x[:, D:]
2025-05-07T20:32:51.5556186Z     
2025-05-07T20:32:51.5556371Z         if contiguous:
2025-05-07T20:32:51.5556607Z             x0 = x0.contiguous()
2025-05-07T20:32:51.5556870Z             x1 = x1.contiguous()
2025-05-07T20:32:51.5557112Z     
2025-05-07T20:32:51.5557321Z         if scale_ub is not None:
2025-05-07T20:32:51.5557599Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.5557930Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.5558246Z             )
2025-05-07T20:32:51.5558444Z         else:
2025-05-07T20:32:51.5558657Z             scale_ub_tensor = None
2025-05-07T20:32:51.5558919Z     
2025-05-07T20:32:51.5559162Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.5559754Z             op = silu_mul_quant
2025-05-07T20:32:51.5560013Z             if compiled:
2025-05-07T20:32:51.5560274Z                 op = torch.compile(op)
2025-05-07T20:32:51.5560566Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.5560850Z     
2025-05-07T20:32:51.5561056Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.5561224Z 
2025-05-07T20:32:51.5561330Z moe/activation_test.py:117: 
2025-05-07T20:32:51.5561619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.5561960Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.5562296Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.5562973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.5563658Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.5564197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.5564887Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.5565549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.5566076Z     kernel = self.compile(
2025-05-07T20:32:51.5566612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.5567260Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.5567651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.5567881Z 
2025-05-07T20:32:51.5568088Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f2964e0>
2025-05-07T20:32:51.5569159Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.5570670Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f231120>}
2025-05-07T20:32:51.5571985Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.5573114Z context = <triton._C.libtriton.ir.context object at 0x7fc79f10bfb0>
2025-05-07T20:32:51.5573407Z 
2025-05-07T20:32:51.5573574Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.5574094Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.5574552Z                            module_map=module_map)
2025-05-07T20:32:51.5574923Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.5575289Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.5575611Z E       ^
2025-05-07T20:32:51.5576078Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.5576527Z 
2025-05-07T20:32:51.5576938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.5577441Z 
2025-05-07T20:32:51.5577552Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.5577965Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.5578371Z     T=16384,
2025-05-07T20:32:51.5578571Z     D=7168,
2025-05-07T20:32:51.5578764Z     scale_ub=1200.0,
2025-05-07T20:32:51.5578994Z     contiguous=False,
2025-05-07T20:32:51.5579223Z     compiled=True,
2025-05-07T20:32:51.7981766Z )
2025-05-07T20:32:51.7982203Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.7982723Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.7983007Z 
2025-05-07T20:32:51.7983088Z     @given(
2025-05-07T20:32:51.7983331Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.7983648Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.7983960Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.7984302Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.7984639Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.7984935Z     )
2025-05-07T20:32:51.7985290Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.7985738Z     def test_silu_mul_quant(
2025-05-07T20:32:51.7985976Z         self,
2025-05-07T20:32:51.7986179Z         T: int,
2025-05-07T20:32:51.7986390Z         D: int,
2025-05-07T20:32:51.7986615Z         scale_ub: Optional[float],
2025-05-07T20:32:51.7986887Z         contiguous: bool,
2025-05-07T20:32:51.7987137Z         compiled: bool,
2025-05-07T20:32:51.7987370Z     ) -> None:
2025-05-07T20:32:51.7987583Z         torch.manual_seed(2025)
2025-05-07T20:32:51.7987828Z     
2025-05-07T20:32:51.7988109Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.7988445Z     
2025-05-07T20:32:51.7988652Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.7988950Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.7989265Z         x = x_sign * x_clamp
2025-05-07T20:32:51.7989524Z         x0 = x[:, :D]
2025-05-07T20:32:51.7989747Z         x1 = x[:, D:]
2025-05-07T20:32:51.7989951Z     
2025-05-07T20:32:51.7990147Z         if contiguous:
2025-05-07T20:32:51.7990391Z             x0 = x0.contiguous()
2025-05-07T20:32:51.7990652Z             x1 = x1.contiguous()
2025-05-07T20:32:51.7990903Z     
2025-05-07T20:32:51.7991108Z         if scale_ub is not None:
2025-05-07T20:32:51.7991635Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.7991975Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.7992381Z             )
2025-05-07T20:32:51.7992590Z         else:
2025-05-07T20:32:51.7992804Z             scale_ub_tensor = None
2025-05-07T20:32:51.7993071Z     
2025-05-07T20:32:51.7993322Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.7993639Z             op = silu_mul_quant
2025-05-07T20:32:51.7993898Z             if compiled:
2025-05-07T20:32:51.7994165Z                 op = torch.compile(op)
2025-05-07T20:32:51.7994543Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.7994833Z     
2025-05-07T20:32:51.7995038Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.7995203Z 
2025-05-07T20:32:51.7995303Z moe/activation_test.py:117: 
2025-05-07T20:32:51.7995603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.7995935Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.7996223Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.7996849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.7997409Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.7998068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.7998743Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.7999277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.7999964Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.8000621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.8001143Z     kernel = self.compile(
2025-05-07T20:32:51.8001681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.8002335Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.8002724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.8002957Z 
2025-05-07T20:32:51.8003165Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f2948f0>
2025-05-07T20:32:51.8004236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.8005600Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f232520>}
2025-05-07T20:32:51.8006924Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.8007940Z context = <triton._C.libtriton.ir.context object at 0x7fc79f1e06b0>
2025-05-07T20:32:51.8008232Z 
2025-05-07T20:32:51.8008397Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.8008917Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.8009385Z                            module_map=module_map)
2025-05-07T20:32:51.8009751Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.8010115Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.8010382Z E       ^
2025-05-07T20:32:51.8010840Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.8011292Z 
2025-05-07T20:32:51.8011703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.8012323Z 
2025-05-07T20:32:51.8012432Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.8012897Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.8013406Z     T=1,
2025-05-07T20:32:51.8013600Z     D=7168,
2025-05-07T20:32:51.8013803Z     scale_ub=None,
2025-05-07T20:32:51.8014015Z     contiguous=False,
2025-05-07T20:32:51.8014252Z     compiled=False,
2025-05-07T20:32:51.8014462Z )
2025-05-07T20:32:51.8014777Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.8015312Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.8015579Z 
2025-05-07T20:32:51.8015657Z     @given(
2025-05-07T20:32:51.8015893Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.8016204Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.8016516Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.8016854Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.8017187Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.8017533Z     )
2025-05-07T20:32:51.8017886Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.8018330Z     def test_silu_mul_quant(
2025-05-07T20:32:51.8018570Z         self,
2025-05-07T20:32:51.8018773Z         T: int,
2025-05-07T20:32:51.8018976Z         D: int,
2025-05-07T20:32:51.8019192Z         scale_ub: Optional[float],
2025-05-07T20:32:51.8019464Z         contiguous: bool,
2025-05-07T20:32:51.8019710Z         compiled: bool,
2025-05-07T20:32:51.8019935Z     ) -> None:
2025-05-07T20:32:51.8020156Z         torch.manual_seed(2025)
2025-05-07T20:32:51.8020407Z     
2025-05-07T20:32:51.8020679Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.8021018Z     
2025-05-07T20:32:51.8021222Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.8021510Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.8021826Z         x = x_sign * x_clamp
2025-05-07T20:32:51.8022083Z         x0 = x[:, :D]
2025-05-07T20:32:51.8022340Z         x1 = x[:, D:]
2025-05-07T20:32:51.8022567Z     
2025-05-07T20:32:51.8022762Z         if contiguous:
2025-05-07T20:32:51.8022996Z             x0 = x0.contiguous()
2025-05-07T20:32:51.8023255Z             x1 = x1.contiguous()
2025-05-07T20:32:51.8023506Z     
2025-05-07T20:32:51.8023708Z         if scale_ub is not None:
2025-05-07T20:32:51.8023983Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.8024327Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.8024639Z             )
2025-05-07T20:32:51.8024832Z         else:
2025-05-07T20:32:51.8025053Z             scale_ub_tensor = None
2025-05-07T20:32:51.8025312Z     
2025-05-07T20:32:51.8025547Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.8025864Z             op = silu_mul_quant
2025-05-07T20:32:51.8026138Z             if compiled:
2025-05-07T20:32:51.8026388Z                 op = torch.compile(op)
2025-05-07T20:32:51.8026693Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.8026980Z     
2025-05-07T20:32:51.8027173Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.8027353Z 
2025-05-07T20:32:51.8027455Z moe/activation_test.py:117: 
2025-05-07T20:32:51.8027762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.8028097Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.8028383Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.8029074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.8029762Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.8030294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.8031021Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.8031752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.8032276Z     kernel = self.compile(
2025-05-07T20:32:51.8032808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.8033462Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.8033860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.8034131Z 
2025-05-07T20:32:51.8034339Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f2972c0>
2025-05-07T20:32:51.8035410Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.8036837Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f233100>}
2025-05-07T20:32:51.8038157Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.8039166Z context = <triton._C.libtriton.ir.context object at 0x7fc79f171330>
2025-05-07T20:32:51.8039451Z 
2025-05-07T20:32:51.8039619Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.8040137Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.8040605Z                            module_map=module_map)
2025-05-07T20:32:51.8040974Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.8041322Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.8041588Z E       ^
2025-05-07T20:32:51.8042094Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.8042555Z 
2025-05-07T20:32:51.8042963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.8043471Z 
2025-05-07T20:32:51.8043575Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.8043985Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.8044387Z     T=2048,
2025-05-07T20:32:51.8044574Z     D=7168,
2025-05-07T20:32:51.8044768Z     scale_ub=None,
2025-05-07T20:32:51.8044984Z     contiguous=False,
2025-05-07T20:32:51.8045204Z     compiled=True,
2025-05-07T20:32:51.8045410Z )
2025-05-07T20:32:51.8920742Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.8921777Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:51.8922069Z 
2025-05-07T20:32:51.8922149Z     @given(
2025-05-07T20:32:51.8922402Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.8922875Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.8923184Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.8923515Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.8923837Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.8924127Z     )
2025-05-07T20:32:51.8924487Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.8924922Z     def test_silu_mul_quant(
2025-05-07T20:32:51.8925166Z         self,
2025-05-07T20:32:51.8925365Z         T: int,
2025-05-07T20:32:51.8925561Z         D: int,
2025-05-07T20:32:51.8925785Z         scale_ub: Optional[float],
2025-05-07T20:32:51.8926067Z         contiguous: bool,
2025-05-07T20:32:51.8926307Z         compiled: bool,
2025-05-07T20:32:51.8926764Z     ) -> None:
2025-05-07T20:32:51.8926992Z         torch.manual_seed(2025)
2025-05-07T20:32:51.8927245Z     
2025-05-07T20:32:51.8927598Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.8927952Z     
2025-05-07T20:32:51.8928159Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.8928451Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.8928765Z         x = x_sign * x_clamp
2025-05-07T20:32:51.8929021Z         x0 = x[:, :D]
2025-05-07T20:32:51.8929239Z         x1 = x[:, D:]
2025-05-07T20:32:51.8929449Z     
2025-05-07T20:32:51.8929724Z         if contiguous:
2025-05-07T20:32:51.8929950Z             x0 = x0.contiguous()
2025-05-07T20:32:51.8930216Z             x1 = x1.contiguous()
2025-05-07T20:32:51.8930464Z     
2025-05-07T20:32:51.8930655Z         if scale_ub is not None:
2025-05-07T20:32:51.8930936Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.8931269Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.8931577Z             )
2025-05-07T20:32:51.8931776Z         else:
2025-05-07T20:32:51.8931995Z             scale_ub_tensor = None
2025-05-07T20:32:51.8932329Z     
2025-05-07T20:32:51.8932567Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.8932885Z             op = silu_mul_quant
2025-05-07T20:32:51.8933236Z             if compiled:
2025-05-07T20:32:51.8933481Z                 op = torch.compile(op)
2025-05-07T20:32:51.8933774Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.8934048Z     
2025-05-07T20:32:51.8934242Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.8934414Z 
2025-05-07T20:32:51.8934518Z moe/activation_test.py:117: 
2025-05-07T20:32:51.8934819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.8935139Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.8935420Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.8935972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.8936535Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.8937188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.8937870Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.8938408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.8939079Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.8939745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.8940280Z     kernel = self.compile(
2025-05-07T20:32:51.8940828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.8941473Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.8941881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.8942114Z 
2025-05-07T20:32:51.8942330Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a0778c0>
2025-05-07T20:32:51.8943393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.8944742Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a0f8720>}
2025-05-07T20:32:51.8946065Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.8947131Z context = <triton._C.libtriton.ir.context object at 0x7fc88a0dcd70>
2025-05-07T20:32:51.8947416Z 
2025-05-07T20:32:51.8947628Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.8948136Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.8948604Z                            module_map=module_map)
2025-05-07T20:32:51.8948971Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.8949322Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.8949579Z E       ^
2025-05-07T20:32:51.8950083Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.8950523Z 
2025-05-07T20:32:51.8950936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.8951437Z 
2025-05-07T20:32:51.8951548Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.8952006Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.8952401Z     T=4096,
2025-05-07T20:32:51.8952639Z     D=7168,
2025-05-07T20:32:51.8952826Z     scale_ub=None,
2025-05-07T20:32:51.8953042Z     contiguous=False,
2025-05-07T20:32:51.8953268Z     compiled=True,
2025-05-07T20:32:51.8953471Z )
2025-05-07T20:32:51.8953790Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.8954286Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:51.8954554Z 
2025-05-07T20:32:51.8954629Z     @given(
2025-05-07T20:32:51.8954865Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.8955182Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.8955494Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.8955823Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.8956155Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.8956438Z     )
2025-05-07T20:32:51.8956785Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.8957231Z     def test_silu_mul_quant(
2025-05-07T20:32:51.8957479Z         self,
2025-05-07T20:32:51.8957672Z         T: int,
2025-05-07T20:32:51.8957880Z         D: int,
2025-05-07T20:32:51.8958103Z         scale_ub: Optional[float],
2025-05-07T20:32:51.8958373Z         contiguous: bool,
2025-05-07T20:32:51.8958622Z         compiled: bool,
2025-05-07T20:32:51.8958851Z     ) -> None:
2025-05-07T20:32:51.8959069Z         torch.manual_seed(2025)
2025-05-07T20:32:51.8959720Z     
2025-05-07T20:32:51.8960000Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.8960334Z     
2025-05-07T20:32:51.8960537Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.8960831Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.8961144Z         x = x_sign * x_clamp
2025-05-07T20:32:51.8961392Z         x0 = x[:, :D]
2025-05-07T20:32:51.8961625Z         x1 = x[:, D:]
2025-05-07T20:32:51.8961846Z     
2025-05-07T20:32:51.8962067Z         if contiguous:
2025-05-07T20:32:51.8962347Z             x0 = x0.contiguous()
2025-05-07T20:32:51.8962626Z             x1 = x1.contiguous()
2025-05-07T20:32:51.8962872Z     
2025-05-07T20:32:51.8963078Z         if scale_ub is not None:
2025-05-07T20:32:51.8963364Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.8963699Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.8964013Z             )
2025-05-07T20:32:51.8964222Z         else:
2025-05-07T20:32:51.8964437Z             scale_ub_tensor = None
2025-05-07T20:32:51.8964693Z     
2025-05-07T20:32:51.8964935Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.8965246Z             op = silu_mul_quant
2025-05-07T20:32:51.8965512Z             if compiled:
2025-05-07T20:32:51.8965769Z                 op = torch.compile(op)
2025-05-07T20:32:51.8966149Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.8966419Z     
2025-05-07T20:32:51.8966611Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.8966832Z 
2025-05-07T20:32:51.8966937Z moe/activation_test.py:117: 
2025-05-07T20:32:51.8967228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.8967566Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.8967853Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.8968407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.8969035Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.8969689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.8970374Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.8970901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.8971581Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.8972340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.8972893Z     kernel = self.compile(
2025-05-07T20:32:51.8973478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.8974123Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.8974520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.8974743Z 
2025-05-07T20:32:51.8974950Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a074ad0>
2025-05-07T20:32:51.8976014Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.8977374Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a0f9440>}
2025-05-07T20:32:51.8978695Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.8979707Z context = <triton._C.libtriton.ir.context object at 0x7fc88a0ae7f0>
2025-05-07T20:32:51.8980016Z 
2025-05-07T20:32:51.8987324Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.8987899Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.8988375Z                            module_map=module_map)
2025-05-07T20:32:51.8988742Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.8989113Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.8989379Z E       ^
2025-05-07T20:32:51.8989846Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.8990301Z 
2025-05-07T20:32:51.8990724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.8991244Z 
2025-05-07T20:32:52.0579951Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.0580427Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.0580853Z     T=16384,
2025-05-07T20:32:52.0581056Z     D=5120,
2025-05-07T20:32:52.0581262Z     scale_ub=1200.0,
2025-05-07T20:32:52.0581494Z     contiguous=False,
2025-05-07T20:32:52.0581722Z     compiled=False,
2025-05-07T20:32:52.0581939Z )
2025-05-07T20:32:52.0582266Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.0582969Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:52.0583266Z 
2025-05-07T20:32:52.0583431Z     @given(
2025-05-07T20:32:52.0583684Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.0583996Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.0584310Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.0584650Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.0584984Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.0585350Z     )
2025-05-07T20:32:52.0585702Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.0586149Z     def test_silu_mul_quant(
2025-05-07T20:32:52.0586391Z         self,
2025-05-07T20:32:52.0586599Z         T: int,
2025-05-07T20:32:52.0586814Z         D: int,
2025-05-07T20:32:52.0587039Z         scale_ub: Optional[float],
2025-05-07T20:32:52.0587318Z         contiguous: bool,
2025-05-07T20:32:52.0587573Z         compiled: bool,
2025-05-07T20:32:52.0587803Z     ) -> None:
2025-05-07T20:32:52.0588028Z         torch.manual_seed(2025)
2025-05-07T20:32:52.0588361Z     
2025-05-07T20:32:52.0588639Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.0588991Z     
2025-05-07T20:32:52.0589197Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.0589490Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.0589805Z         x = x_sign * x_clamp
2025-05-07T20:32:52.0590055Z         x0 = x[:, :D]
2025-05-07T20:32:52.0590287Z         x1 = x[:, D:]
2025-05-07T20:32:52.0590497Z     
2025-05-07T20:32:52.0590692Z         if contiguous:
2025-05-07T20:32:52.0590935Z             x0 = x0.contiguous()
2025-05-07T20:32:52.0591192Z             x1 = x1.contiguous()
2025-05-07T20:32:52.0591447Z     
2025-05-07T20:32:52.0591645Z         if scale_ub is not None:
2025-05-07T20:32:52.0591923Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.0592316Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.0592627Z             )
2025-05-07T20:32:52.0592824Z         else:
2025-05-07T20:32:52.0593045Z             scale_ub_tensor = None
2025-05-07T20:32:52.0593309Z     
2025-05-07T20:32:52.0593545Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.0593863Z             op = silu_mul_quant
2025-05-07T20:32:52.0594121Z             if compiled:
2025-05-07T20:32:52.0594374Z                 op = torch.compile(op)
2025-05-07T20:32:52.0594691Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.0594978Z     
2025-05-07T20:32:52.0595186Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.0595352Z 
2025-05-07T20:32:52.0595457Z moe/activation_test.py:117: 
2025-05-07T20:32:52.0595762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.0596102Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.0596388Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.0597093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.0597790Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.0598328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.0599013Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.0599689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.0600235Z     kernel = self.compile(
2025-05-07T20:32:52.0600778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.0601439Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.0601842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.0602124Z 
2025-05-07T20:32:52.0602368Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a077ce0>
2025-05-07T20:32:52.0603857Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.0605231Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a0fa340>}
2025-05-07T20:32:52.0606604Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.0607628Z context = <triton._C.libtriton.ir.context object at 0x7fc79ef28770>
2025-05-07T20:32:52.0607914Z 
2025-05-07T20:32:52.0608095Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.0608649Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.0609126Z                            module_map=module_map)
2025-05-07T20:32:52.0609501Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.0609854Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.0610123Z E       ^
2025-05-07T20:32:52.0610589Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.0611036Z 
2025-05-07T20:32:52.0611459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.0611963Z 
2025-05-07T20:32:52.0612071Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.0612492Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.0612896Z     T=16384,
2025-05-07T20:32:52.0613175Z     D=5120,
2025-05-07T20:32:52.0613379Z     scale_ub=1200.0,
2025-05-07T20:32:52.0613613Z     contiguous=True,
2025-05-07T20:32:52.0613839Z     compiled=True,
2025-05-07T20:32:52.0614051Z )
2025-05-07T20:32:52.0614373Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.0614863Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:52.0615148Z 
2025-05-07T20:32:52.0615225Z     @given(
2025-05-07T20:32:52.0615465Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.0615792Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.0616103Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.0616437Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.0616773Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.0617060Z     )
2025-05-07T20:32:52.0617415Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.0617866Z     def test_silu_mul_quant(
2025-05-07T20:32:52.0618109Z         self,
2025-05-07T20:32:52.0618323Z         T: int,
2025-05-07T20:32:52.0618537Z         D: int,
2025-05-07T20:32:52.0618765Z         scale_ub: Optional[float],
2025-05-07T20:32:52.0619054Z         contiguous: bool,
2025-05-07T20:32:52.0619309Z         compiled: bool,
2025-05-07T20:32:52.0619545Z     ) -> None:
2025-05-07T20:32:52.0619766Z         torch.manual_seed(2025)
2025-05-07T20:32:52.0620029Z     
2025-05-07T20:32:52.0620317Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.0620670Z     
2025-05-07T20:32:52.0620877Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.0621188Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.0621502Z         x = x_sign * x_clamp
2025-05-07T20:32:52.0621761Z         x0 = x[:, :D]
2025-05-07T20:32:52.0621993Z         x1 = x[:, D:]
2025-05-07T20:32:52.0622261Z     
2025-05-07T20:32:52.0622464Z         if contiguous:
2025-05-07T20:32:52.0622707Z             x0 = x0.contiguous()
2025-05-07T20:32:52.0623016Z             x1 = x1.contiguous()
2025-05-07T20:32:52.0623277Z     
2025-05-07T20:32:52.0623486Z         if scale_ub is not None:
2025-05-07T20:32:52.0623764Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.0624115Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.0624444Z             )
2025-05-07T20:32:52.0624654Z         else:
2025-05-07T20:32:52.0624875Z             scale_ub_tensor = None
2025-05-07T20:32:52.0625186Z     
2025-05-07T20:32:52.0625432Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.0625753Z             op = silu_mul_quant
2025-05-07T20:32:52.0626017Z             if compiled:
2025-05-07T20:32:52.0626287Z                 op = torch.compile(op)
2025-05-07T20:32:52.0626587Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.0626880Z     
2025-05-07T20:32:52.0627090Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.0627258Z 
2025-05-07T20:32:52.0627362Z moe/activation_test.py:117: 
2025-05-07T20:32:52.0627720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.0628059Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.0628351Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.0628915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.0629485Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.0630148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.0630831Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.0631373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.0632088Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.0632763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.0633300Z     kernel = self.compile(
2025-05-07T20:32:52.0633852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.0634507Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.0634904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.0635144Z 
2025-05-07T20:32:52.0635353Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a077e30>
2025-05-07T20:32:52.0636432Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.0637786Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a0fb9c0>}
2025-05-07T20:32:52.0639117Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.0640134Z context = <triton._C.libtriton.ir.context object at 0x7fc79f3502b0>
2025-05-07T20:32:52.0640430Z 
2025-05-07T20:32:52.0640595Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.0641121Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.0641580Z                            module_map=module_map)
2025-05-07T20:32:52.0641949Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.0642307Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.0642619Z E       ^
2025-05-07T20:32:52.0643076Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.0643563Z 
2025-05-07T20:32:52.0643980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.0644481Z 
2025-05-07T20:32:52.2355285Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.2356191Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.2356985Z     T=16384,
2025-05-07T20:32:52.2357822Z     D=5120,
2025-05-07T20:32:52.2358213Z     scale_ub=None,
2025-05-07T20:32:52.2358656Z     contiguous=False,
2025-05-07T20:32:52.2359107Z     compiled=True,
2025-05-07T20:32:52.2359906Z )
2025-05-07T20:32:52.2360535Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.2361527Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.2362046Z 
2025-05-07T20:32:52.2362128Z     @given(
2025-05-07T20:32:52.2362363Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.2362769Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.2363081Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.2363421Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.2363744Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.2364044Z     )
2025-05-07T20:32:52.2364395Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.2364849Z     def test_silu_mul_quant(
2025-05-07T20:32:52.2365089Z         self,
2025-05-07T20:32:52.2365298Z         T: int,
2025-05-07T20:32:52.2365506Z         D: int,
2025-05-07T20:32:52.2365732Z         scale_ub: Optional[float],
2025-05-07T20:32:52.2366008Z         contiguous: bool,
2025-05-07T20:32:52.2366248Z         compiled: bool,
2025-05-07T20:32:52.2366469Z     ) -> None:
2025-05-07T20:32:52.2366694Z         torch.manual_seed(2025)
2025-05-07T20:32:52.2366944Z     
2025-05-07T20:32:52.2367214Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.2367560Z     
2025-05-07T20:32:52.2367754Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.2368041Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.2368350Z         x = x_sign * x_clamp
2025-05-07T20:32:52.2368594Z         x0 = x[:, :D]
2025-05-07T20:32:52.2368808Z         x1 = x[:, D:]
2025-05-07T20:32:52.2369018Z     
2025-05-07T20:32:52.2369210Z         if contiguous:
2025-05-07T20:32:52.2369442Z             x0 = x0.contiguous()
2025-05-07T20:32:52.2369702Z             x1 = x1.contiguous()
2025-05-07T20:32:52.2369946Z     
2025-05-07T20:32:52.2370134Z         if scale_ub is not None:
2025-05-07T20:32:52.2370410Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.2370745Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.2371058Z             )
2025-05-07T20:32:52.2371255Z         else:
2025-05-07T20:32:52.2371471Z             scale_ub_tensor = None
2025-05-07T20:32:52.2371729Z     
2025-05-07T20:32:52.2371963Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.2372277Z             op = silu_mul_quant
2025-05-07T20:32:52.2372534Z             if compiled:
2025-05-07T20:32:52.2372782Z                 op = torch.compile(op)
2025-05-07T20:32:52.2373185Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.2373467Z     
2025-05-07T20:32:52.2373657Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.2373829Z 
2025-05-07T20:32:52.2373928Z moe/activation_test.py:117: 
2025-05-07T20:32:52.2374224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.2374554Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.2374836Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.2375394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.2376036Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.2376756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.2377442Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.2377979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.2378656Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.2379376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.2379907Z     kernel = self.compile(
2025-05-07T20:32:52.2380452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.2381098Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.2381496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.2381727Z 
2025-05-07T20:32:52.2381977Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f3d7920>
2025-05-07T20:32:52.2383048Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.2384405Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f34cc20>}
2025-05-07T20:32:52.2385730Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.2386742Z context = <triton._C.libtriton.ir.context object at 0x7fc79f3660b0>
2025-05-07T20:32:52.2387028Z 
2025-05-07T20:32:52.2387200Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.2387717Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.2388178Z                            module_map=module_map)
2025-05-07T20:32:52.2388545Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.2388899Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.2389151Z E       ^
2025-05-07T20:32:52.2389614Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.2390061Z 
2025-05-07T20:32:52.2390479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.2390981Z 
2025-05-07T20:32:52.2391092Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.2391505Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.2391919Z     T=2048,
2025-05-07T20:32:52.2392153Z     D=5120,
2025-05-07T20:32:52.2392363Z     scale_ub=None,
2025-05-07T20:32:52.2392593Z     contiguous=False,
2025-05-07T20:32:52.2392822Z     compiled=True,
2025-05-07T20:32:52.2393021Z )
2025-05-07T20:32:52.3296715Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3297241Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.3297616Z 
2025-05-07T20:32:52.3297726Z     @given(
2025-05-07T20:32:52.3298058Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3298369Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3298676Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3299006Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3299330Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3299881Z     )
2025-05-07T20:32:52.3300239Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3300757Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3301016Z         self,
2025-05-07T20:32:52.3301220Z         T: int,
2025-05-07T20:32:52.3301417Z         D: int,
2025-05-07T20:32:52.3301644Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3301921Z         contiguous: bool,
2025-05-07T20:32:52.3302168Z         compiled: bool,
2025-05-07T20:32:52.3302402Z     ) -> None:
2025-05-07T20:32:52.3302632Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3302958Z     
2025-05-07T20:32:52.3303227Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3303576Z     
2025-05-07T20:32:52.3303777Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3304068Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3304390Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3304644Z         x0 = x[:, :D]
2025-05-07T20:32:52.3304869Z         x1 = x[:, D:]
2025-05-07T20:32:52.3305084Z     
2025-05-07T20:32:52.3305271Z         if contiguous:
2025-05-07T20:32:52.3305507Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3305845Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3306094Z     
2025-05-07T20:32:52.3306283Z         if scale_ub is not None:
2025-05-07T20:32:52.3306564Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3306900Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3307215Z             )
2025-05-07T20:32:52.3307409Z         else:
2025-05-07T20:32:52.3307631Z             scale_ub_tensor = None
2025-05-07T20:32:52.3307891Z     
2025-05-07T20:32:52.3308124Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3308438Z             op = silu_mul_quant
2025-05-07T20:32:52.3308698Z             if compiled:
2025-05-07T20:32:52.3308945Z                 op = torch.compile(op)
2025-05-07T20:32:52.3309256Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3309542Z     
2025-05-07T20:32:52.3309729Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3309904Z 
2025-05-07T20:32:52.3310009Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3310317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3310642Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3310933Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3311494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3312058Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3312718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3313411Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3313944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3314644Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3315310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3315841Z     kernel = self.compile(
2025-05-07T20:32:52.3316390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3317046Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3317453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3317697Z 
2025-05-07T20:32:52.3317914Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f3d6030>
2025-05-07T20:32:52.3319000Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3320454Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f34d9e0>}
2025-05-07T20:32:52.3321794Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3322809Z context = <triton._C.libtriton.ir.context object at 0x7fc79ed751b0>
2025-05-07T20:32:52.3323094Z 
2025-05-07T20:32:52.3323309Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3323832Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3324301Z                            module_map=module_map)
2025-05-07T20:32:52.3324676Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3325042Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3325307Z E       ^
2025-05-07T20:32:52.3325822Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3326270Z 
2025-05-07T20:32:52.3326693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3327200Z 
2025-05-07T20:32:52.3327331Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3327751Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3328150Z     T=2048,
2025-05-07T20:32:52.3328344Z     D=5120,
2025-05-07T20:32:52.3328542Z     scale_ub=1200.0,
2025-05-07T20:32:52.3328768Z     contiguous=False,
2025-05-07T20:32:52.3328997Z     compiled=True,
2025-05-07T20:32:52.3329209Z )
2025-05-07T20:32:52.3329527Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3330028Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:52.3330307Z 
2025-05-07T20:32:52.3330398Z     @given(
2025-05-07T20:32:52.3330639Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3330961Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3331276Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3331615Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3331948Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3332243Z     )
2025-05-07T20:32:52.3332601Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3333131Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3333386Z         self,
2025-05-07T20:32:52.3333590Z         T: int,
2025-05-07T20:32:52.3333791Z         D: int,
2025-05-07T20:32:52.3334025Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3334308Z         contiguous: bool,
2025-05-07T20:32:52.3334552Z         compiled: bool,
2025-05-07T20:32:52.3334794Z     ) -> None:
2025-05-07T20:32:52.3335017Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3335264Z     
2025-05-07T20:32:52.3343398Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3343796Z     
2025-05-07T20:32:52.3343994Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3344298Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3344603Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3344844Z         x0 = x[:, :D]
2025-05-07T20:32:52.3345059Z         x1 = x[:, D:]
2025-05-07T20:32:52.3345275Z     
2025-05-07T20:32:52.3345466Z         if contiguous:
2025-05-07T20:32:52.3345695Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3345956Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3346203Z     
2025-05-07T20:32:52.3346391Z         if scale_ub is not None:
2025-05-07T20:32:52.3346676Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3347012Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3347400Z             )
2025-05-07T20:32:52.3347589Z         else:
2025-05-07T20:32:52.3347846Z             scale_ub_tensor = None
2025-05-07T20:32:52.3348103Z     
2025-05-07T20:32:52.3348345Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3348667Z             op = silu_mul_quant
2025-05-07T20:32:52.3348918Z             if compiled:
2025-05-07T20:32:52.3349177Z                 op = torch.compile(op)
2025-05-07T20:32:52.3349481Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3349751Z     
2025-05-07T20:32:52.3350000Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3350174Z 
2025-05-07T20:32:52.3350276Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3350581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3350915Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3351197Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3351764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3352325Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3353030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3353722Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3354268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3354943Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3355622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3356156Z     kernel = self.compile(
2025-05-07T20:32:52.3356695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3357348Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3357754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3357987Z 
2025-05-07T20:32:52.3358206Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f3d5d00>
2025-05-07T20:32:52.3359563Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3360935Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f34eb60>}
2025-05-07T20:32:52.3362268Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3363294Z context = <triton._C.libtriton.ir.context object at 0x7fc79f0815b0>
2025-05-07T20:32:52.3363582Z 
2025-05-07T20:32:52.3363765Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3364279Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3364753Z                            module_map=module_map)
2025-05-07T20:32:52.3365130Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3365479Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3365746Z E       ^
2025-05-07T20:32:52.3366216Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3366664Z 
2025-05-07T20:32:52.3367098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3367608Z 
2025-05-07T20:32:52.5109977Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.5110764Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.5111437Z     T=4096,
2025-05-07T20:32:52.5111653Z     D=5120,
2025-05-07T20:32:52.5111911Z     scale_ub=1200.0,
2025-05-07T20:32:52.5112376Z     contiguous=True,
2025-05-07T20:32:52.5112822Z     compiled=True,
2025-05-07T20:32:52.5113235Z )
2025-05-07T20:32:52.5113868Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.5114858Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:52.5115537Z 
2025-05-07T20:32:52.5115706Z     @given(
2025-05-07T20:32:52.5116163Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.5116792Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.5117404Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.5118047Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.5118706Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.5119280Z     )
2025-05-07T20:32:52.5119988Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.5120986Z     def test_silu_mul_quant(
2025-05-07T20:32:52.5121484Z         self,
2025-05-07T20:32:52.5121881Z         T: int,
2025-05-07T20:32:52.5122196Z         D: int,
2025-05-07T20:32:52.5122456Z         scale_ub: Optional[float],
2025-05-07T20:32:52.5122755Z         contiguous: bool,
2025-05-07T20:32:52.5122997Z         compiled: bool,
2025-05-07T20:32:52.5123231Z     ) -> None:
2025-05-07T20:32:52.5123458Z         torch.manual_seed(2025)
2025-05-07T20:32:52.5123699Z     
2025-05-07T20:32:52.5123977Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.5124327Z     
2025-05-07T20:32:52.5124520Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.5124820Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.5125136Z         x = x_sign * x_clamp
2025-05-07T20:32:52.5125389Z         x0 = x[:, :D]
2025-05-07T20:32:52.5125613Z         x1 = x[:, D:]
2025-05-07T20:32:52.5125831Z     
2025-05-07T20:32:52.5126031Z         if contiguous:
2025-05-07T20:32:52.5126264Z             x0 = x0.contiguous()
2025-05-07T20:32:52.5126536Z             x1 = x1.contiguous()
2025-05-07T20:32:52.5126783Z     
2025-05-07T20:32:52.5126977Z         if scale_ub is not None:
2025-05-07T20:32:52.5127259Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.5127604Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.5127915Z             )
2025-05-07T20:32:52.5128121Z         else:
2025-05-07T20:32:52.5128343Z             scale_ub_tensor = None
2025-05-07T20:32:52.5128598Z     
2025-05-07T20:32:52.5128844Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.5129166Z             op = silu_mul_quant
2025-05-07T20:32:52.5129416Z             if compiled:
2025-05-07T20:32:52.5129676Z                 op = torch.compile(op)
2025-05-07T20:32:52.5129986Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.5130266Z     
2025-05-07T20:32:52.5130469Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.5130646Z 
2025-05-07T20:32:52.5130748Z moe/activation_test.py:117: 
2025-05-07T20:32:52.5131054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.5131387Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.5131677Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.5132240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.5132799Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.5133575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.5134268Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.5134812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.5135588Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.5136265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.5136804Z     kernel = self.compile(
2025-05-07T20:32:52.5137350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.5138010Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.5138457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.5138685Z 
2025-05-07T20:32:52.5138902Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f0cfb30>
2025-05-07T20:32:52.5139981Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.5141410Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f4ec180>}
2025-05-07T20:32:52.5142810Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.5143834Z context = <triton._C.libtriton.ir.context object at 0x7fc79f478fb0>
2025-05-07T20:32:52.5144127Z 
2025-05-07T20:32:52.5144301Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.5144821Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.5145294Z                            module_map=module_map)
2025-05-07T20:32:52.5145675Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.5146033Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.5146303Z E       ^
2025-05-07T20:32:52.5146782Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.5147237Z 
2025-05-07T20:32:52.5147667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.5148179Z 
2025-05-07T20:32:52.5148300Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.5148727Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.5149142Z     T=128,
2025-05-07T20:32:52.5149334Z     D=5120,
2025-05-07T20:32:52.5149541Z     scale_ub=1200.0,
2025-05-07T20:32:52.5149777Z     contiguous=False,
2025-05-07T20:32:52.5150013Z     compiled=True,
2025-05-07T20:32:52.5150217Z )
2025-05-07T20:32:52.7821649Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.7822308Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:52.7822718Z 
2025-05-07T20:32:52.7822850Z     @given(
2025-05-07T20:32:52.7823163Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.7823575Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.7823888Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.7824222Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.7824555Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.7824857Z     )
2025-05-07T20:32:52.7825212Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.7825657Z     def test_silu_mul_quant(
2025-05-07T20:32:52.7825909Z         self,
2025-05-07T20:32:52.7826116Z         T: int,
2025-05-07T20:32:52.7826314Z         D: int,
2025-05-07T20:32:52.7826543Z         scale_ub: Optional[float],
2025-05-07T20:32:52.7827119Z         contiguous: bool,
2025-05-07T20:32:52.7827357Z         compiled: bool,
2025-05-07T20:32:52.7827599Z     ) -> None:
2025-05-07T20:32:52.7827912Z         torch.manual_seed(2025)
2025-05-07T20:32:52.7828163Z     
2025-05-07T20:32:52.7828442Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.7828793Z     
2025-05-07T20:32:52.7828990Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.7829291Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.7829610Z         x = x_sign * x_clamp
2025-05-07T20:32:52.7829848Z         x0 = x[:, :D]
2025-05-07T20:32:52.7830213Z         x1 = x[:, D:]
2025-05-07T20:32:52.7830429Z     
2025-05-07T20:32:52.7830621Z         if contiguous:
2025-05-07T20:32:52.7830856Z             x0 = x0.contiguous()
2025-05-07T20:32:52.7831120Z             x1 = x1.contiguous()
2025-05-07T20:32:52.7831367Z     
2025-05-07T20:32:52.7831564Z         if scale_ub is not None:
2025-05-07T20:32:52.7831840Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.7832175Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.7832493Z             )
2025-05-07T20:32:52.7832723Z         else:
2025-05-07T20:32:52.7833034Z             scale_ub_tensor = None
2025-05-07T20:32:52.7833293Z     
2025-05-07T20:32:52.7833523Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.7833839Z             op = silu_mul_quant
2025-05-07T20:32:52.7834093Z             if compiled:
2025-05-07T20:32:52.7834337Z                 op = torch.compile(op)
2025-05-07T20:32:52.7834640Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.7834926Z     
2025-05-07T20:32:52.7835143Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.7835306Z 
2025-05-07T20:32:52.7835407Z moe/activation_test.py:117: 
2025-05-07T20:32:52.7835704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.7836033Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.7836311Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.7836873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.7837436Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.7838094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.7838770Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.7839303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.7839982Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.7840643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.7841164Z     kernel = self.compile(
2025-05-07T20:32:52.7841708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.7842364Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.7842758Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.7842989Z 
2025-05-07T20:32:52.7843197Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f0cd340>
2025-05-07T20:32:52.7844261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.7845628Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f4ecea0>}
2025-05-07T20:32:52.7846947Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.7848011Z context = <triton._C.libtriton.ir.context object at 0x7fc79f45b830>
2025-05-07T20:32:52.7848346Z 
2025-05-07T20:32:52.7848516Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.7849029Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.7849492Z                            module_map=module_map)
2025-05-07T20:32:52.7849850Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.7850248Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.7850511Z E       ^
2025-05-07T20:32:52.7850965Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.7851420Z 
2025-05-07T20:32:52.7851830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.7852363Z 
2025-05-07T20:32:52.7852483Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.7853075Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.7853476Z     T=16384,
2025-05-07T20:32:52.7853675Z     D=7168,
2025-05-07T20:32:52.7853874Z     scale_ub=1200.0,
2025-05-07T20:32:52.7854095Z     contiguous=True,
2025-05-07T20:32:52.7854317Z     compiled=True,
2025-05-07T20:32:52.7854524Z )
2025-05-07T20:32:52.7854839Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.7855331Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:52.7855606Z 
2025-05-07T20:32:52.7855698Z     @given(
2025-05-07T20:32:52.7855925Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.7856240Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.7856548Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.7856880Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.7857205Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.7857494Z     )
2025-05-07T20:32:52.7857851Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.7858283Z     def test_silu_mul_quant(
2025-05-07T20:32:52.7858528Z         self,
2025-05-07T20:32:52.7858731Z         T: int,
2025-05-07T20:32:52.7858926Z         D: int,
2025-05-07T20:32:52.7859152Z         scale_ub: Optional[float],
2025-05-07T20:32:52.7859727Z         contiguous: bool,
2025-05-07T20:32:52.7859967Z         compiled: bool,
2025-05-07T20:32:52.7860189Z     ) -> None:
2025-05-07T20:32:52.7860407Z         torch.manual_seed(2025)
2025-05-07T20:32:52.7860649Z     
2025-05-07T20:32:52.7860916Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.7861261Z     
2025-05-07T20:32:52.7861447Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.7861742Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.7862064Z         x = x_sign * x_clamp
2025-05-07T20:32:52.7862307Z         x0 = x[:, :D]
2025-05-07T20:32:52.7862531Z         x1 = x[:, D:]
2025-05-07T20:32:52.7862793Z     
2025-05-07T20:32:52.7862992Z         if contiguous:
2025-05-07T20:32:52.7863224Z             x0 = x0.contiguous()
2025-05-07T20:32:52.7863487Z             x1 = x1.contiguous()
2025-05-07T20:32:52.7863734Z     
2025-05-07T20:32:52.7863929Z         if scale_ub is not None:
2025-05-07T20:32:52.7864212Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.7864549Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.7864856Z             )
2025-05-07T20:32:52.7865057Z         else:
2025-05-07T20:32:52.7865274Z             scale_ub_tensor = None
2025-05-07T20:32:52.7865529Z     
2025-05-07T20:32:52.7865764Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.7866082Z             op = silu_mul_quant
2025-05-07T20:32:52.7866333Z             if compiled:
2025-05-07T20:32:52.7866660Z                 op = torch.compile(op)
2025-05-07T20:32:52.7866956Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.7867299Z     
2025-05-07T20:32:52.7867497Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.7867670Z 
2025-05-07T20:32:52.7867772Z moe/activation_test.py:117: 
2025-05-07T20:32:52.7868076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.7868404Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.7868692Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.7869315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.7869864Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.7870518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.7871206Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.7871742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.7872487Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.7873152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.7873683Z     kernel = self.compile(
2025-05-07T20:32:52.7874225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.7874867Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.7875267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.7875493Z 
2025-05-07T20:32:52.7875705Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f0cd070>
2025-05-07T20:32:52.7876774Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.7878128Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f4ee0c0>}
2025-05-07T20:32:52.7879446Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.7880460Z context = <triton._C.libtriton.ir.context object at 0x7fc79ef42f70>
2025-05-07T20:32:52.7880744Z 
2025-05-07T20:32:52.7880914Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.7881425Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.7881893Z                            module_map=module_map)
2025-05-07T20:32:52.7882263Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.7882623Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.7882887Z E       ^
2025-05-07T20:32:52.7883357Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.7883801Z 
2025-05-07T20:32:52.7884217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.7884717Z 
2025-05-07T20:32:52.9117377Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.9117938Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.9118487Z     T=16384,
2025-05-07T20:32:52.9118713Z     D=5120,
2025-05-07T20:32:52.9118912Z     scale_ub=1200.0,
2025-05-07T20:32:52.9119130Z     contiguous=True,
2025-05-07T20:32:52.9119364Z     compiled=False,
2025-05-07T20:32:52.9119574Z )
2025-05-07T20:32:52.9120157Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.9120742Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:52.9121028Z 
2025-05-07T20:32:52.9121110Z     @given(
2025-05-07T20:32:52.9121344Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.9121657Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.9121966Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.9122298Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.9122724Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.9123010Z     )
2025-05-07T20:32:52.9123363Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.9123813Z     def test_silu_mul_quant(
2025-05-07T20:32:52.9124056Z         self,
2025-05-07T20:32:52.9124263Z         T: int,
2025-05-07T20:32:52.9124473Z         D: int,
2025-05-07T20:32:52.9124697Z         scale_ub: Optional[float],
2025-05-07T20:32:52.9124970Z         contiguous: bool,
2025-05-07T20:32:52.9125216Z         compiled: bool,
2025-05-07T20:32:52.9125513Z     ) -> None:
2025-05-07T20:32:52.9125735Z         torch.manual_seed(2025)
2025-05-07T20:32:52.9125989Z     
2025-05-07T20:32:52.9126256Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.9126598Z     
2025-05-07T20:32:52.9126790Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.9127073Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.9127378Z         x = x_sign * x_clamp
2025-05-07T20:32:52.9127623Z         x0 = x[:, :D]
2025-05-07T20:32:52.9127833Z         x1 = x[:, D:]
2025-05-07T20:32:52.9128048Z     
2025-05-07T20:32:52.9128244Z         if contiguous:
2025-05-07T20:32:52.9128473Z             x0 = x0.contiguous()
2025-05-07T20:32:52.9128734Z             x1 = x1.contiguous()
2025-05-07T20:32:52.9128980Z     
2025-05-07T20:32:52.9129180Z         if scale_ub is not None:
2025-05-07T20:32:52.9129456Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.9129802Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.9130116Z             )
2025-05-07T20:32:52.9130310Z         else:
2025-05-07T20:32:52.9130527Z             scale_ub_tensor = None
2025-05-07T20:32:52.9130777Z     
2025-05-07T20:32:52.9131003Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.9131317Z             op = silu_mul_quant
2025-05-07T20:32:52.9131568Z             if compiled:
2025-05-07T20:32:52.9131810Z                 op = torch.compile(op)
2025-05-07T20:32:52.9132111Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.9132392Z     
2025-05-07T20:32:52.9132585Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.9132758Z 
2025-05-07T20:32:52.9132860Z moe/activation_test.py:117: 
2025-05-07T20:32:52.9133280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.9133615Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.9133901Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.9134589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.9135275Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.9135799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.9136472Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.9144695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.9145239Z     kernel = self.compile(
2025-05-07T20:32:52.9145773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.9146411Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.9146906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.9147133Z 
2025-05-07T20:32:52.9147386Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f0ce6f0>
2025-05-07T20:32:52.9148457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.9149820Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f4eda80>}
2025-05-07T20:32:52.9151191Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.9152201Z context = <triton._C.libtriton.ir.context object at 0x7fc79ef904f0>
2025-05-07T20:32:52.9152526Z 
2025-05-07T20:32:52.9152706Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.9153270Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.9153736Z                            module_map=module_map)
2025-05-07T20:32:52.9154106Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.9154454Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.9154723Z E       ^
2025-05-07T20:32:52.9155187Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.9155635Z 
2025-05-07T20:32:52.9156048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.9156558Z 
2025-05-07T20:32:52.9156662Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.9157079Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.9157478Z     T=1,
2025-05-07T20:32:52.9157659Z     D=7168,
2025-05-07T20:32:52.9157864Z     scale_ub=1200.0,
2025-05-07T20:32:52.9158090Z     contiguous=False,
2025-05-07T20:32:52.9158311Z     compiled=False,
2025-05-07T20:32:52.9158522Z )
2025-05-07T20:32:52.9158846Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.9159612Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:52.9159886Z 
2025-05-07T20:32:52.9159967Z     @given(
2025-05-07T20:32:52.9160200Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.9160504Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.9160812Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.9161138Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.9161464Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.9161744Z     )
2025-05-07T20:32:52.9162089Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.9162527Z     def test_silu_mul_quant(
2025-05-07T20:32:52.9162793Z         self,
2025-05-07T20:32:52.9163009Z         T: int,
2025-05-07T20:32:52.9163207Z         D: int,
2025-05-07T20:32:52.9163426Z         scale_ub: Optional[float],
2025-05-07T20:32:52.9163700Z         contiguous: bool,
2025-05-07T20:32:52.9163947Z         compiled: bool,
2025-05-07T20:32:52.9164175Z     ) -> None:
2025-05-07T20:32:52.9164401Z         torch.manual_seed(2025)
2025-05-07T20:32:52.9164647Z     
2025-05-07T20:32:52.9164914Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.9165263Z     
2025-05-07T20:32:52.9165463Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.9165762Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.9166064Z         x = x_sign * x_clamp
2025-05-07T20:32:52.9166314Z         x0 = x[:, :D]
2025-05-07T20:32:52.9166627Z         x1 = x[:, D:]
2025-05-07T20:32:52.9166835Z     
2025-05-07T20:32:52.9167035Z         if contiguous:
2025-05-07T20:32:52.9167342Z             x0 = x0.contiguous()
2025-05-07T20:32:52.9167599Z             x1 = x1.contiguous()
2025-05-07T20:32:52.9167852Z     
2025-05-07T20:32:52.9168046Z         if scale_ub is not None:
2025-05-07T20:32:52.9168319Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.9168642Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.9168958Z             )
2025-05-07T20:32:52.9169149Z         else:
2025-05-07T20:32:52.9169439Z             scale_ub_tensor = None
2025-05-07T20:32:52.9169699Z     
2025-05-07T20:32:52.9169928Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.9170245Z             op = silu_mul_quant
2025-05-07T20:32:52.9170496Z             if compiled:
2025-05-07T20:32:52.9170750Z                 op = torch.compile(op)
2025-05-07T20:32:52.9171041Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.9171315Z     
2025-05-07T20:32:52.9171511Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.9171673Z 
2025-05-07T20:32:52.9171843Z moe/activation_test.py:117: 
2025-05-07T20:32:52.9172141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.9172474Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.9172750Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.9173498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.9174185Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.9174716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.9175385Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.9176053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.9176591Z     kernel = self.compile(
2025-05-07T20:32:52.9177128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.9177772Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.9178168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.9178391Z 
2025-05-07T20:32:52.9178607Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79ef15bb0>
2025-05-07T20:32:52.9179668Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.9181022Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79ef680e0>}
2025-05-07T20:32:52.9182355Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.9183361Z context = <triton._C.libtriton.ir.context object at 0x7fc79ec35fb0>
2025-05-07T20:32:52.9183644Z 
2025-05-07T20:32:52.9183818Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.9184330Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.9184801Z                            module_map=module_map)
2025-05-07T20:32:52.9185169Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.9185522Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.9185790Z E       ^
2025-05-07T20:32:52.9186252Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.9186747Z 
2025-05-07T20:32:52.9187207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.9187709Z 
2025-05-07T20:32:53.0917260Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.0918307Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.0919126Z     T=4096,
2025-05-07T20:32:53.0919513Z     D=7168,
2025-05-07T20:32:53.0919897Z     scale_ub=1200.0,
2025-05-07T20:32:53.0920356Z     contiguous=False,
2025-05-07T20:32:53.0921116Z     compiled=True,
2025-05-07T20:32:53.0921530Z )
2025-05-07T20:32:53.0922163Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.0922783Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:53.0923059Z 
2025-05-07T20:32:53.0923149Z     @given(
2025-05-07T20:32:53.0923384Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.0923718Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.0924039Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.0924456Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.0924793Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.0925097Z     )
2025-05-07T20:32:53.0925447Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.0925903Z     def test_silu_mul_quant(
2025-05-07T20:32:53.0926157Z         self,
2025-05-07T20:32:53.0926371Z         T: int,
2025-05-07T20:32:53.0926575Z         D: int,
2025-05-07T20:32:53.0926810Z         scale_ub: Optional[float],
2025-05-07T20:32:53.0927097Z         contiguous: bool,
2025-05-07T20:32:53.0927350Z         compiled: bool,
2025-05-07T20:32:53.0927588Z     ) -> None:
2025-05-07T20:32:53.0927816Z         torch.manual_seed(2025)
2025-05-07T20:32:53.0928060Z     
2025-05-07T20:32:53.0928344Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.0928697Z     
2025-05-07T20:32:53.0928897Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.0929206Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.0929531Z         x = x_sign * x_clamp
2025-05-07T20:32:53.0929775Z         x0 = x[:, :D]
2025-05-07T20:32:53.0930004Z         x1 = x[:, D:]
2025-05-07T20:32:53.0930226Z     
2025-05-07T20:32:53.0930420Z         if contiguous:
2025-05-07T20:32:53.0930663Z             x0 = x0.contiguous()
2025-05-07T20:32:53.0930932Z             x1 = x1.contiguous()
2025-05-07T20:32:53.0931188Z     
2025-05-07T20:32:53.0931383Z         if scale_ub is not None:
2025-05-07T20:32:53.0931669Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.0932016Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.0932328Z             )
2025-05-07T20:32:53.0932554Z         else:
2025-05-07T20:32:53.0932800Z             scale_ub_tensor = None
2025-05-07T20:32:53.0933151Z     
2025-05-07T20:32:53.0933390Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.0933708Z             op = silu_mul_quant
2025-05-07T20:32:53.0933960Z             if compiled:
2025-05-07T20:32:53.0934218Z                 op = torch.compile(op)
2025-05-07T20:32:53.0934529Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.0934813Z     
2025-05-07T20:32:53.0935006Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.0935179Z 
2025-05-07T20:32:53.0935281Z moe/activation_test.py:117: 
2025-05-07T20:32:53.0935585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.0935925Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.0936210Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.0936772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.0937338Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.0938086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.0938847Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.0939393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.0940073Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.0940731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.0941347Z     kernel = self.compile(
2025-05-07T20:32:53.0941892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.0942538Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.0942938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.0943172Z 
2025-05-07T20:32:53.0943380Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79ef14f80>
2025-05-07T20:32:53.0944496Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.0945865Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79ef69300>}
2025-05-07T20:32:53.0947184Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.0948200Z context = <triton._C.libtriton.ir.context object at 0x7fc79ee942f0>
2025-05-07T20:32:53.0948492Z 
2025-05-07T20:32:53.0948660Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.0949186Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.0949652Z                            module_map=module_map)
2025-05-07T20:32:53.0950022Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.0950381Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.0950638Z E       ^
2025-05-07T20:32:53.0951106Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.0951562Z 
2025-05-07T20:32:53.0951988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.0952522Z 
2025-05-07T20:32:53.0952659Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.0953075Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.0953488Z     T=128,
2025-05-07T20:32:53.0953690Z     D=7168,
2025-05-07T20:32:53.0953883Z     scale_ub=1200.0,
2025-05-07T20:32:53.0954127Z     contiguous=False,
2025-05-07T20:32:53.0954364Z     compiled=True,
2025-05-07T20:32:53.0954568Z )
2025-05-07T20:32:53.1866088Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.1866603Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:53.1866965Z 
2025-05-07T20:32:53.1867091Z     @given(
2025-05-07T20:32:53.1867406Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.1867834Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.1868274Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.1868703Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.1869136Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.1869507Z     )
2025-05-07T20:32:53.1869858Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.1870562Z     def test_silu_mul_quant(
2025-05-07T20:32:53.1870808Z         self,
2025-05-07T20:32:53.1871003Z         T: int,
2025-05-07T20:32:53.1871300Z         D: int,
2025-05-07T20:32:53.1871526Z         scale_ub: Optional[float],
2025-05-07T20:32:53.1871807Z         contiguous: bool,
2025-05-07T20:32:53.1872052Z         compiled: bool,
2025-05-07T20:32:53.1872284Z     ) -> None:
2025-05-07T20:32:53.1872504Z         torch.manual_seed(2025)
2025-05-07T20:32:53.1872788Z     
2025-05-07T20:32:53.1873056Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.1873488Z     
2025-05-07T20:32:53.1873692Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.1873989Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.1874293Z         x = x_sign * x_clamp
2025-05-07T20:32:53.1874537Z         x0 = x[:, :D]
2025-05-07T20:32:53.1874765Z         x1 = x[:, D:]
2025-05-07T20:32:53.1874971Z     
2025-05-07T20:32:53.1875167Z         if contiguous:
2025-05-07T20:32:53.1875407Z             x0 = x0.contiguous()
2025-05-07T20:32:53.1875667Z             x1 = x1.contiguous()
2025-05-07T20:32:53.1875918Z     
2025-05-07T20:32:53.1876197Z         if scale_ub is not None:
2025-05-07T20:32:53.1876476Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.1876826Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.1877144Z             )
2025-05-07T20:32:53.1877339Z         else:
2025-05-07T20:32:53.1877560Z             scale_ub_tensor = None
2025-05-07T20:32:53.1877824Z     
2025-05-07T20:32:53.1878061Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.1878387Z             op = silu_mul_quant
2025-05-07T20:32:53.1878652Z             if compiled:
2025-05-07T20:32:53.1878916Z                 op = torch.compile(op)
2025-05-07T20:32:53.1879217Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.1879506Z     
2025-05-07T20:32:53.1879712Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.1879884Z 
2025-05-07T20:32:53.1879986Z moe/activation_test.py:117: 
2025-05-07T20:32:53.1880295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.1880645Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.1880927Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.1881490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.1882052Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.1882722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.1883396Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.1883939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.1884618Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.1885275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.1885812Z     kernel = self.compile(
2025-05-07T20:32:53.1886368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.1887025Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.1887424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.1887659Z 
2025-05-07T20:32:53.1887871Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79ef15310>
2025-05-07T20:32:53.1888947Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.1890314Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79ef6a020>}
2025-05-07T20:32:53.1891724Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.1892808Z context = <triton._C.libtriton.ir.context object at 0x7fc79ee0a8f0>
2025-05-07T20:32:53.1893203Z 
2025-05-07T20:32:53.1893371Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.1893933Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.1894394Z                            module_map=module_map)
2025-05-07T20:32:53.1894763Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.1895118Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.1895382Z E       ^
2025-05-07T20:32:53.1895843Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.1896294Z 
2025-05-07T20:32:53.1896750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.1897255Z 
2025-05-07T20:32:53.1897366Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.1897779Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.1898179Z     T=2048,
2025-05-07T20:32:53.1898373Z     D=7168,
2025-05-07T20:32:53.1898571Z     scale_ub=None,
2025-05-07T20:32:53.1898783Z     contiguous=True,
2025-05-07T20:32:53.1899014Z     compiled=True,
2025-05-07T20:32:53.1899225Z )
2025-05-07T20:32:53.1899540Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.1900033Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:53.1900300Z 
2025-05-07T20:32:53.1900395Z     @given(
2025-05-07T20:32:53.1900629Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.1900948Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.1901277Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.1901615Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.1901947Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.1902246Z     )
2025-05-07T20:32:53.1902617Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.1903090Z     def test_silu_mul_quant(
2025-05-07T20:32:53.1903342Z         self,
2025-05-07T20:32:53.1903552Z         T: int,
2025-05-07T20:32:53.1903754Z         D: int,
2025-05-07T20:32:53.1903992Z         scale_ub: Optional[float],
2025-05-07T20:32:53.1904270Z         contiguous: bool,
2025-05-07T20:32:53.1904514Z         compiled: bool,
2025-05-07T20:32:53.1904751Z     ) -> None:
2025-05-07T20:32:53.1904971Z         torch.manual_seed(2025)
2025-05-07T20:32:53.1905221Z     
2025-05-07T20:32:53.1905505Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.1905858Z     
2025-05-07T20:32:53.1906062Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.1906360Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.1906679Z         x = x_sign * x_clamp
2025-05-07T20:32:53.1906932Z         x0 = x[:, :D]
2025-05-07T20:32:53.1907155Z         x1 = x[:, D:]
2025-05-07T20:32:53.1907376Z     
2025-05-07T20:32:53.1907575Z         if contiguous:
2025-05-07T20:32:53.1907805Z             x0 = x0.contiguous()
2025-05-07T20:32:53.1908076Z             x1 = x1.contiguous()
2025-05-07T20:32:53.1908322Z     
2025-05-07T20:32:53.1908516Z         if scale_ub is not None:
2025-05-07T20:32:53.1908798Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.1909142Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.1909452Z             )
2025-05-07T20:32:53.1909653Z         else:
2025-05-07T20:32:53.1909931Z             scale_ub_tensor = None
2025-05-07T20:32:53.1910179Z     
2025-05-07T20:32:53.1910458Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.1910780Z             op = silu_mul_quant
2025-05-07T20:32:53.1911030Z             if compiled:
2025-05-07T20:32:53.1911288Z                 op = torch.compile(op)
2025-05-07T20:32:53.1911595Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.1911875Z     
2025-05-07T20:32:53.1912071Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.1912246Z 
2025-05-07T20:32:53.1912358Z moe/activation_test.py:117: 
2025-05-07T20:32:53.1912741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.1913069Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.1913357Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.1913917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.1914478Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.1915188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.1915883Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.1916420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.1917102Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.1917775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.1918311Z     kernel = self.compile(
2025-05-07T20:32:53.1918863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.1919525Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.1919934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.1920168Z 
2025-05-07T20:32:53.1920389Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79ee30680>
2025-05-07T20:32:53.1921455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.1922861Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79ef6b240>}
2025-05-07T20:32:53.1924212Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.1925254Z context = <triton._C.libtriton.ir.context object at 0x7fc79ee65830>
2025-05-07T20:32:53.1925543Z 
2025-05-07T20:32:53.1925730Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.1926263Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.1926754Z                            module_map=module_map)
2025-05-07T20:32:53.1927142Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.1927497Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.1927769Z E       ^
2025-05-07T20:32:53.1928249Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.1928710Z 
2025-05-07T20:32:53.1929140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.1929642Z 
2025-05-07T20:32:53.2536239Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.2536890Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.2537665Z     T=16384,
2025-05-07T20:32:53.2537922Z     D=5120,
2025-05-07T20:32:53.2538112Z     scale_ub=None,
2025-05-07T20:32:53.2538465Z     contiguous=False,
2025-05-07T20:32:53.2538700Z     compiled=False,
2025-05-07T20:32:53.2538907Z )
2025-05-07T20:32:53.2539246Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.2539747Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:53.2540032Z 
2025-05-07T20:32:53.2540115Z     @given(
2025-05-07T20:32:53.2540344Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.2540725Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.2541030Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.2541366Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.2541690Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.2541977Z     )
2025-05-07T20:32:53.2542346Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.2542824Z     def test_silu_mul_quant(
2025-05-07T20:32:53.2543068Z         self,
2025-05-07T20:32:53.2543337Z         T: int,
2025-05-07T20:32:53.2550829Z         D: int,
2025-05-07T20:32:53.2551114Z         scale_ub: Optional[float],
2025-05-07T20:32:53.2551399Z         contiguous: bool,
2025-05-07T20:32:53.2551646Z         compiled: bool,
2025-05-07T20:32:53.2551870Z     ) -> None:
2025-05-07T20:32:53.2552095Z         torch.manual_seed(2025)
2025-05-07T20:32:53.2552352Z     
2025-05-07T20:32:53.2552636Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.2552982Z     
2025-05-07T20:32:53.2553187Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.2553484Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.2555499Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.2557347Z 
2025-05-07T20:32:53.2557469Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:53.2557694Z 
2025-05-07T20:32:53.2557798Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.2558216Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.2558613Z     T=4096,
2025-05-07T20:32:53.2558814Z     D=7168,
2025-05-07T20:32:53.2559018Z     scale_ub=1200.0,
2025-05-07T20:32:53.2559523Z     contiguous=True,
2025-05-07T20:32:53.2559758Z     compiled=True,
2025-05-07T20:32:53.2559967Z )
2025-05-07T20:32:53.2560288Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.2560790Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:53.2561062Z 
2025-05-07T20:32:53.2561149Z     @given(
2025-05-07T20:32:53.2561381Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.2561700Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.2562014Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.2562348Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.2562673Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.2562964Z     )
2025-05-07T20:32:53.2563316Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.2563757Z     def test_silu_mul_quant(
2025-05-07T20:32:53.2564008Z         self,
2025-05-07T20:32:53.2564216Z         T: int,
2025-05-07T20:32:53.2564418Z         D: int,
2025-05-07T20:32:53.2564769Z         scale_ub: Optional[float],
2025-05-07T20:32:53.2565044Z         contiguous: bool,
2025-05-07T20:32:53.2565282Z         compiled: bool,
2025-05-07T20:32:53.2565591Z     ) -> None:
2025-05-07T20:32:53.2565823Z         torch.manual_seed(2025)
2025-05-07T20:32:53.2566068Z     
2025-05-07T20:32:53.2566352Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.2566700Z     
2025-05-07T20:32:53.2566907Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.2567198Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.2569206Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.2571139Z 
2025-05-07T20:32:53.2571325Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:53.2571537Z 
2025-05-07T20:32:53.2571650Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.2572063Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.2572516Z     T=16384,
2025-05-07T20:32:53.2572734Z     D=7168,
2025-05-07T20:32:53.2572935Z     scale_ub=None,
2025-05-07T20:32:53.2573237Z     contiguous=False,
2025-05-07T20:32:53.2573479Z     compiled=False,
2025-05-07T20:32:53.2573690Z )
2025-05-07T20:32:53.2574007Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.2574508Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:53.2574788Z 
2025-05-07T20:32:53.2574890Z     @given(
2025-05-07T20:32:53.2575124Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.2575450Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.2575774Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.2576108Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.2576455Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.2576760Z     )
2025-05-07T20:32:53.2577125Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.2577568Z     def test_silu_mul_quant(
2025-05-07T20:32:53.2577825Z         self,
2025-05-07T20:32:53.2578040Z         T: int,
2025-05-07T20:32:53.2578243Z         D: int,
2025-05-07T20:32:53.2578475Z         scale_ub: Optional[float],
2025-05-07T20:32:53.2578761Z         contiguous: bool,
2025-05-07T20:32:53.2579007Z         compiled: bool,
2025-05-07T20:32:53.2579246Z     ) -> None:
2025-05-07T20:32:53.2579469Z         torch.manual_seed(2025)
2025-05-07T20:32:53.2579710Z     
2025-05-07T20:32:53.2579995Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.2582037Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.2583894Z 
2025-05-07T20:32:53.2584015Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:53.2584224Z 
2025-05-07T20:32:53.2584338Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.2584743Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.2585202Z     T=2048,
2025-05-07T20:32:53.2585402Z     D=7168,
2025-05-07T20:32:53.2585590Z     scale_ub=1200.0,
2025-05-07T20:32:53.2585824Z     contiguous=True,
2025-05-07T20:32:53.2586098Z     compiled=True,
2025-05-07T20:32:53.2586302Z )
2025-05-07T20:32:53.2586629Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.2587121Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:53.2587393Z 
2025-05-07T20:32:53.2587480Z     @given(
2025-05-07T20:32:53.2587710Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.2588068Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.2588381Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.2588709Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.2589048Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.2589343Z     )
2025-05-07T20:32:53.2589686Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.2590138Z     def test_silu_mul_quant(
2025-05-07T20:32:53.2590385Z         self,
2025-05-07T20:32:53.2590582Z         T: int,
2025-05-07T20:32:53.2590822Z         D: int,
2025-05-07T20:32:53.2591044Z         scale_ub: Optional[float],
2025-05-07T20:32:53.2591315Z         contiguous: bool,
2025-05-07T20:32:53.2591562Z         compiled: bool,
2025-05-07T20:32:53.2591794Z     ) -> None:
2025-05-07T20:32:53.2592015Z         torch.manual_seed(2025)
2025-05-07T20:32:53.2592257Z     
2025-05-07T20:32:53.2592534Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.2592872Z     
2025-05-07T20:32:53.2593067Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.2593364Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.2595351Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.2597184Z 
2025-05-07T20:32:53.2597308Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:53.2597519Z 
2025-05-07T20:32:53.2597623Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.2598041Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.2598442Z     T=2048,
2025-05-07T20:32:53.2598634Z     D=7168,
2025-05-07T20:32:53.2598820Z     scale_ub=None,
2025-05-07T20:32:53.2599042Z     contiguous=True,
2025-05-07T20:32:53.2599277Z     compiled=False,
2025-05-07T20:32:53.2599475Z )
2025-05-07T20:32:53.3733234Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.3733989Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:53.3734370Z 
2025-05-07T20:32:53.3734504Z     @given(
2025-05-07T20:32:53.3734734Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.3735052Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.3735361Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.3735706Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.3736033Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.3736328Z     )
2025-05-07T20:32:53.3736674Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.3737109Z     def test_silu_mul_quant(
2025-05-07T20:32:53.3737356Z         self,
2025-05-07T20:32:53.3737559Z         T: int,
2025-05-07T20:32:53.3737756Z         D: int,
2025-05-07T20:32:53.3737979Z         scale_ub: Optional[float],
2025-05-07T20:32:53.3738499Z         contiguous: bool,
2025-05-07T20:32:53.3738736Z         compiled: bool,
2025-05-07T20:32:53.3738966Z     ) -> None:
2025-05-07T20:32:53.3739284Z         torch.manual_seed(2025)
2025-05-07T20:32:53.3739529Z     
2025-05-07T20:32:53.3739814Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.3740160Z     
2025-05-07T20:32:53.3740355Z >       x_sign = torch.sign(x)
2025-05-07T20:32:53.3742280Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.3744199Z 
2025-05-07T20:32:53.3744317Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:53.3744535Z 
2025-05-07T20:32:53.3744709Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.3745129Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.3745522Z     T=1,
2025-05-07T20:32:53.3745711Z     D=7168,
2025-05-07T20:32:53.3745911Z     scale_ub=1200.0,
2025-05-07T20:32:53.3746132Z     contiguous=True,
2025-05-07T20:32:53.3746359Z     compiled=False,
2025-05-07T20:32:53.3746572Z )
2025-05-07T20:32:53.3746896Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.3747375Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:53.3747648Z 
2025-05-07T20:32:53.3747730Z     @given(
2025-05-07T20:32:53.3747967Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.3748278Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.3748581Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.3748908Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.3749245Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.3749523Z     )
2025-05-07T20:32:53.3749868Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.3750305Z     def test_silu_mul_quant(
2025-05-07T20:32:53.3750539Z         self,
2025-05-07T20:32:53.3750737Z         T: int,
2025-05-07T20:32:53.3750944Z         D: int,
2025-05-07T20:32:53.3751164Z         scale_ub: Optional[float],
2025-05-07T20:32:53.3751436Z         contiguous: bool,
2025-05-07T20:32:53.3751681Z         compiled: bool,
2025-05-07T20:32:53.3751907Z     ) -> None:
2025-05-07T20:32:53.3752126Z         torch.manual_seed(2025)
2025-05-07T20:32:53.3752373Z     
2025-05-07T20:32:53.3752646Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.3752980Z     
2025-05-07T20:32:53.3753178Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.3753469Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.3753776Z         x = x_sign * x_clamp
2025-05-07T20:32:53.3754024Z         x0 = x[:, :D]
2025-05-07T20:32:53.3754240Z         x1 = x[:, D:]
2025-05-07T20:32:53.3754445Z     
2025-05-07T20:32:53.3754633Z         if contiguous:
2025-05-07T20:32:53.3754866Z             x0 = x0.contiguous()
2025-05-07T20:32:53.3755115Z             x1 = x1.contiguous()
2025-05-07T20:32:53.3755354Z     
2025-05-07T20:32:53.3755550Z         if scale_ub is not None:
2025-05-07T20:32:53.3755822Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.3756153Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.3756476Z             )
2025-05-07T20:32:53.3756666Z         else:
2025-05-07T20:32:53.3756884Z             scale_ub_tensor = None
2025-05-07T20:32:53.3757137Z     
2025-05-07T20:32:53.3757372Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.3757746Z             op = silu_mul_quant
2025-05-07T20:32:53.3757998Z             if compiled:
2025-05-07T20:32:53.3758287Z                 op = torch.compile(op)
2025-05-07T20:32:53.3758586Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.3758863Z     
2025-05-07T20:32:53.3759061Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.3759501Z 
2025-05-07T20:32:53.3759606Z moe/activation_test.py:117: 
2025-05-07T20:32:53.3759905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.3760241Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.3760613Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.3761301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.3761986Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.3762515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.3763196Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.3763920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.3764447Z     kernel = self.compile(
2025-05-07T20:32:53.3764978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.3765626Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.3766022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.3766245Z 
2025-05-07T20:32:53.3766457Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79ee33cb0>
2025-05-07T20:32:53.3767520Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.3768883Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79eca2520>}
2025-05-07T20:32:53.3770204Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.3771213Z context = <triton._C.libtriton.ir.context object at 0x7fc79ea797b0>
2025-05-07T20:32:53.3771499Z 
2025-05-07T20:32:53.3771669Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.3772181Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.3772684Z                            module_map=module_map)
2025-05-07T20:32:53.3773145Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.3773493Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.3773758Z E       ^
2025-05-07T20:32:53.3774224Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.3774666Z 
2025-05-07T20:32:53.3775085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.3775587Z 
2025-05-07T20:32:53.3775695Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.3776103Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.3776507Z     T=128,
2025-05-07T20:32:53.3776690Z     D=5120,
2025-05-07T20:32:53.3776885Z     scale_ub=None,
2025-05-07T20:32:53.3777102Z     contiguous=True,
2025-05-07T20:32:53.3777324Z     compiled=False,
2025-05-07T20:32:53.3777532Z )
2025-05-07T20:32:53.4459924Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.4460940Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:53.4461290Z 
2025-05-07T20:32:53.4461480Z     @given(
2025-05-07T20:32:53.4461729Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.4462038Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.4462350Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.4462688Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.4463011Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.4463384Z     )
2025-05-07T20:32:53.4463739Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.4464174Z     def test_silu_mul_quant(
2025-05-07T20:32:53.4464419Z         self,
2025-05-07T20:32:53.4464621Z         T: int,
2025-05-07T20:32:53.4464818Z         D: int,
2025-05-07T20:32:53.4465042Z         scale_ub: Optional[float],
2025-05-07T20:32:53.4465315Z         contiguous: bool,
2025-05-07T20:32:53.4465556Z         compiled: bool,
2025-05-07T20:32:53.4465784Z     ) -> None:
2025-05-07T20:32:53.4466005Z         torch.manual_seed(2025)
2025-05-07T20:32:53.4466322Z     
2025-05-07T20:32:53.4466595Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.4466938Z     
2025-05-07T20:32:53.4467139Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.4467426Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.4467741Z         x = x_sign * x_clamp
2025-05-07T20:32:53.4467987Z         x0 = x[:, :D]
2025-05-07T20:32:53.4468204Z         x1 = x[:, D:]
2025-05-07T20:32:53.4468420Z     
2025-05-07T20:32:53.4468607Z         if contiguous:
2025-05-07T20:32:53.4468834Z             x0 = x0.contiguous()
2025-05-07T20:32:53.4469096Z             x1 = x1.contiguous()
2025-05-07T20:32:53.4469345Z     
2025-05-07T20:32:53.4469534Z         if scale_ub is not None:
2025-05-07T20:32:53.4469809Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.4470148Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.4470456Z             )
2025-05-07T20:32:53.4470659Z         else:
2025-05-07T20:32:53.4470875Z             scale_ub_tensor = None
2025-05-07T20:32:53.4471120Z     
2025-05-07T20:32:53.4471354Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.4471671Z             op = silu_mul_quant
2025-05-07T20:32:53.4471926Z             if compiled:
2025-05-07T20:32:53.4472172Z                 op = torch.compile(op)
2025-05-07T20:32:53.4472475Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.4472799Z     
2025-05-07T20:32:53.4472988Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.4473162Z 
2025-05-07T20:32:53.4473263Z moe/activation_test.py:117: 
2025-05-07T20:32:53.4473564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.4473898Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.4474187Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.4474881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.4475574Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.4476114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.4476794Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.4477460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.4477988Z     kernel = self.compile(
2025-05-07T20:32:53.4478531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.4479182Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.4479588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.4479863Z 
2025-05-07T20:32:53.4480072Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79eaf1f10>
2025-05-07T20:32:53.4481185Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.4482567Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79eca3420>}
2025-05-07T20:32:53.4483977Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.4484990Z context = <triton._C.libtriton.ir.context object at 0x7fc79ea02230>
2025-05-07T20:32:53.4485275Z 
2025-05-07T20:32:53.4485446Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.4486020Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.4486490Z                            module_map=module_map)
2025-05-07T20:32:53.4486851Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.4487215Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.4487478Z E       ^
2025-05-07T20:32:53.4487943Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.4488388Z 
2025-05-07T20:32:53.4488800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.4489309Z 
2025-05-07T20:32:53.4489413Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.4489829Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.4490228Z     T=128,
2025-05-07T20:32:53.4490422Z     D=7168,
2025-05-07T20:32:53.4490623Z     scale_ub=None,
2025-05-07T20:32:53.4490841Z     contiguous=True,
2025-05-07T20:32:53.4491068Z     compiled=False,
2025-05-07T20:32:53.4491285Z )
2025-05-07T20:32:53.4491611Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.4492098Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:53.4492370Z 
2025-05-07T20:32:53.4492448Z     @given(
2025-05-07T20:32:53.4492682Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.4493151Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.4493459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.4493786Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.4494109Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.4494397Z     )
2025-05-07T20:32:53.4494746Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.4495187Z     def test_silu_mul_quant(
2025-05-07T20:32:53.4495422Z         self,
2025-05-07T20:32:53.4495621Z         T: int,
2025-05-07T20:32:53.4495825Z         D: int,
2025-05-07T20:32:53.4496046Z         scale_ub: Optional[float],
2025-05-07T20:32:53.4496322Z         contiguous: bool,
2025-05-07T20:32:53.4496565Z         compiled: bool,
2025-05-07T20:32:53.4496788Z     ) -> None:
2025-05-07T20:32:53.4497006Z         torch.manual_seed(2025)
2025-05-07T20:32:53.4497252Z     
2025-05-07T20:32:53.4497525Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.4497872Z     
2025-05-07T20:32:53.4498074Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.4498365Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.4498682Z         x = x_sign * x_clamp
2025-05-07T20:32:53.4498925Z         x0 = x[:, :D]
2025-05-07T20:32:53.4499139Z         x1 = x[:, D:]
2025-05-07T20:32:53.4499407Z     
2025-05-07T20:32:53.4499602Z         if contiguous:
2025-05-07T20:32:53.4499842Z             x0 = x0.contiguous()
2025-05-07T20:32:53.4500140Z             x1 = x1.contiguous()
2025-05-07T20:32:53.4500393Z     
2025-05-07T20:32:53.4500591Z         if scale_ub is not None:
2025-05-07T20:32:53.4500863Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.4501205Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.4501513Z             )
2025-05-07T20:32:53.4501701Z         else:
2025-05-07T20:32:53.4501913Z             scale_ub_tensor = None
2025-05-07T20:32:53.4502215Z     
2025-05-07T20:32:53.4502451Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.4502813Z             op = silu_mul_quant
2025-05-07T20:32:53.4503071Z             if compiled:
2025-05-07T20:32:53.4503317Z                 op = torch.compile(op)
2025-05-07T20:32:53.4503614Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.4503892Z     
2025-05-07T20:32:53.4504086Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.4504257Z 
2025-05-07T20:32:53.4504357Z moe/activation_test.py:117: 
2025-05-07T20:32:53.4504745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.4505078Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.4505352Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.4506036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.4506715Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.4507245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.4507920Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.4508577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.4509108Z     kernel = self.compile(
2025-05-07T20:32:53.4509642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.4510294Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.4510689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.4510916Z 
2025-05-07T20:32:53.4511129Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79eaf1340>
2025-05-07T20:32:53.4512194Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.4513602Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79ea944a0>}
2025-05-07T20:32:53.4522058Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.4523143Z context = <triton._C.libtriton.ir.context object at 0x7fc79ede40f0>
2025-05-07T20:32:53.4523435Z 
2025-05-07T20:32:53.4523606Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.4524124Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.4524592Z                            module_map=module_map)
2025-05-07T20:32:53.4524964Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.4525315Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.4525578Z E       ^
2025-05-07T20:32:53.4526042Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.4526486Z 
2025-05-07T20:32:53.4526989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.4527502Z 
2025-05-07T20:32:53.4527650Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.4528069Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.4528471Z     T=2048,
2025-05-07T20:32:53.4528658Z     D=7168,
2025-05-07T20:32:53.4528856Z     scale_ub=1200.0,
2025-05-07T20:32:53.4529081Z     contiguous=True,
2025-05-07T20:32:53.4529303Z     compiled=False,
2025-05-07T20:32:53.4529522Z )
2025-05-07T20:32:53.5340759Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.5341546Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:53.5341916Z 
2025-05-07T20:32:53.5342035Z     @given(
2025-05-07T20:32:53.5342300Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.5342629Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.5342956Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.5343282Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.5343877Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.5344171Z     )
2025-05-07T20:32:53.5344523Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.5344959Z     def test_silu_mul_quant(
2025-05-07T20:32:53.5345206Z         self,
2025-05-07T20:32:53.5345407Z         T: int,
2025-05-07T20:32:53.5345603Z         D: int,
2025-05-07T20:32:53.5345831Z         scale_ub: Optional[float],
2025-05-07T20:32:53.5346108Z         contiguous: bool,
2025-05-07T20:32:53.5346343Z         compiled: bool,
2025-05-07T20:32:53.5346571Z     ) -> None:
2025-05-07T20:32:53.5346786Z         torch.manual_seed(2025)
2025-05-07T20:32:53.5347024Z     
2025-05-07T20:32:53.5347303Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.5349357Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.5351203Z 
2025-05-07T20:32:53.5351323Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:53.5351535Z 
2025-05-07T20:32:53.5351647Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.5352060Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.5352466Z     T=1,
2025-05-07T20:32:53.5352684Z     D=5120,
2025-05-07T20:32:53.5352904Z     scale_ub=1200.0,
2025-05-07T20:32:53.5353139Z     contiguous=True,
2025-05-07T20:32:53.5353372Z     compiled=False,
2025-05-07T20:32:53.5353576Z )
2025-05-07T20:32:53.5353902Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.5354393Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:53.5354660Z 
2025-05-07T20:32:53.5354744Z     @given(
2025-05-07T20:32:53.5354972Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.5355289Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.5355599Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.5355935Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.5356273Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.5356568Z     )
2025-05-07T20:32:53.5356915Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.5357358Z     def test_silu_mul_quant(
2025-05-07T20:32:53.5357692Z         self,
2025-05-07T20:32:53.5357888Z         T: int,
2025-05-07T20:32:53.5358099Z         D: int,
2025-05-07T20:32:53.5358404Z         scale_ub: Optional[float],
2025-05-07T20:32:53.5358681Z         contiguous: bool,
2025-05-07T20:32:53.5358933Z         compiled: bool,
2025-05-07T20:32:53.5359162Z     ) -> None:
2025-05-07T20:32:53.5359664Z         torch.manual_seed(2025)
2025-05-07T20:32:53.5359904Z     
2025-05-07T20:32:53.5360177Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.5360519Z     
2025-05-07T20:32:53.5360711Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.5361092Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.5361405Z         x = x_sign * x_clamp
2025-05-07T20:32:53.5361642Z         x0 = x[:, :D]
2025-05-07T20:32:53.5361873Z         x1 = x[:, D:]
2025-05-07T20:32:53.5362088Z     
2025-05-07T20:32:53.5362271Z         if contiguous:
2025-05-07T20:32:53.5362509Z             x0 = x0.contiguous()
2025-05-07T20:32:53.5362786Z             x1 = x1.contiguous()
2025-05-07T20:32:53.5363069Z     
2025-05-07T20:32:53.5363265Z         if scale_ub is not None:
2025-05-07T20:32:53.5363611Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.5363943Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.5364257Z             )
2025-05-07T20:32:53.5364464Z         else:
2025-05-07T20:32:53.5364681Z             scale_ub_tensor = None
2025-05-07T20:32:53.5364945Z     
2025-05-07T20:32:53.5365187Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.5365509Z             op = silu_mul_quant
2025-05-07T20:32:53.5365753Z             if compiled:
2025-05-07T20:32:53.5366010Z                 op = torch.compile(op)
2025-05-07T20:32:53.5366316Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.5366597Z     
2025-05-07T20:32:53.5366801Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.5366967Z 
2025-05-07T20:32:53.5367072Z moe/activation_test.py:117: 
2025-05-07T20:32:53.5367371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.5367709Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.5368003Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.5368695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.5369379Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.5369918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.5370610Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.5371264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.5371812Z     kernel = self.compile(
2025-05-07T20:32:53.5372370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.5373107Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.5373508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.5373750Z 
2025-05-07T20:32:53.5373961Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79eaf2b70>
2025-05-07T20:32:53.5375047Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.5376417Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79ea95a80>}
2025-05-07T20:32:53.5377752Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.5378903Z context = <triton._C.libtriton.ir.context object at 0x7fc79ed25170>
2025-05-07T20:32:53.5379198Z 
2025-05-07T20:32:53.5379370Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.5379896Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.5380362Z                            module_map=module_map)
2025-05-07T20:32:53.5380742Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.5381149Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.5381430Z E       ^
2025-05-07T20:32:53.5381887Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.5382347Z 
2025-05-07T20:32:53.5382811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.5383322Z 
2025-05-07T20:32:53.5383436Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.5383885Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.5384291Z     T=2048,
2025-05-07T20:32:53.5384488Z     D=5120,
2025-05-07T20:32:53.5384684Z     scale_ub=None,
2025-05-07T20:32:53.5384898Z     contiguous=True,
2025-05-07T20:32:53.5385131Z     compiled=False,
2025-05-07T20:32:53.5385340Z )
2025-05-07T20:32:53.5385661Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.5386154Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:53.5386427Z 
2025-05-07T20:32:53.5386515Z     @given(
2025-05-07T20:32:53.5386747Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.5387075Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.5387393Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.5387728Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.5388077Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.5388374Z     )
2025-05-07T20:32:53.5388741Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.5389186Z     def test_silu_mul_quant(
2025-05-07T20:32:53.5389435Z         self,
2025-05-07T20:32:53.5389647Z         T: int,
2025-05-07T20:32:53.5389845Z         D: int,
2025-05-07T20:32:53.5390076Z         scale_ub: Optional[float],
2025-05-07T20:32:53.5390359Z         contiguous: bool,
2025-05-07T20:32:53.5390601Z         compiled: bool,
2025-05-07T20:32:53.5390841Z     ) -> None:
2025-05-07T20:32:53.5391070Z         torch.manual_seed(2025)
2025-05-07T20:32:53.5391316Z     
2025-05-07T20:32:53.5391602Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.5391956Z     
2025-05-07T20:32:53.5392155Z >       x_sign = torch.sign(x)
2025-05-07T20:32:53.5394146Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.5396020Z 
2025-05-07T20:32:53.5396142Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:53.5396363Z 
2025-05-07T20:32:53.5396500Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.5397072Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.5397604Z     T=16384,
2025-05-07T20:32:53.5397803Z     D=5120,
2025-05-07T20:32:53.5398000Z     scale_ub=None,
2025-05-07T20:32:53.5398209Z     contiguous=True,
2025-05-07T20:32:53.5398497Z     compiled=False,
2025-05-07T20:32:53.5398698Z )
2025-05-07T20:32:53.6160714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.6161870Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:53.6162427Z 
2025-05-07T20:32:53.6162596Z     @given(
2025-05-07T20:32:53.6162848Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.6163162Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.6163470Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.6163890Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.6164218Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.6164509Z     )
2025-05-07T20:32:53.6164861Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.6165309Z     def test_silu_mul_quant(
2025-05-07T20:32:53.6165569Z         self,
2025-05-07T20:32:53.6165771Z         T: int,
2025-05-07T20:32:53.6165975Z         D: int,
2025-05-07T20:32:53.6166197Z         scale_ub: Optional[float],
2025-05-07T20:32:53.6166469Z         contiguous: bool,
2025-05-07T20:32:53.6166786Z         compiled: bool,
2025-05-07T20:32:53.6167020Z     ) -> None:
2025-05-07T20:32:53.6167239Z         torch.manual_seed(2025)
2025-05-07T20:32:53.6167482Z     
2025-05-07T20:32:53.6167763Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.6169825Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.6171690Z 
2025-05-07T20:32:53.6171812Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:53.6172025Z 
2025-05-07T20:32:53.6172134Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.6172548Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.6173094Z     T=4096,
2025-05-07T20:32:53.6173287Z     D=5120,
2025-05-07T20:32:53.6173476Z     scale_ub=None,
2025-05-07T20:32:53.6173697Z     contiguous=True,
2025-05-07T20:32:53.6173922Z     compiled=False,
2025-05-07T20:32:53.6174125Z )
2025-05-07T20:32:53.6174442Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.6174928Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:53.6175195Z 
2025-05-07T20:32:53.6175282Z     @given(
2025-05-07T20:32:53.6175509Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.6175825Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.6176128Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.6176451Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.6176778Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.6177067Z     )
2025-05-07T20:32:53.6177414Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.6177852Z     def test_silu_mul_quant(
2025-05-07T20:32:53.6178091Z         self,
2025-05-07T20:32:53.6178297Z         T: int,
2025-05-07T20:32:53.6178500Z         D: int,
2025-05-07T20:32:53.6178719Z         scale_ub: Optional[float],
2025-05-07T20:32:53.6178990Z         contiguous: bool,
2025-05-07T20:32:53.6179226Z         compiled: bool,
2025-05-07T20:32:53.6179448Z     ) -> None:
2025-05-07T20:32:53.6179668Z         torch.manual_seed(2025)
2025-05-07T20:32:53.6179914Z     
2025-05-07T20:32:53.6180188Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.6182324Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.6184185Z 
2025-05-07T20:32:53.6184311Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:53.6184524Z 
2025-05-07T20:32:53.6184636Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.6185043Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.6185445Z     T=2048,
2025-05-07T20:32:53.6185634Z     D=5120,
2025-05-07T20:32:53.6185823Z     scale_ub=None,
2025-05-07T20:32:53.6186041Z     contiguous=False,
2025-05-07T20:32:53.6186274Z     compiled=False,
2025-05-07T20:32:53.6186476Z )
2025-05-07T20:32:53.6186839Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.6187337Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:53.6187608Z 
2025-05-07T20:32:53.6187687Z     @given(
2025-05-07T20:32:53.6187921Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.6188233Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.6188544Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.6188869Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.6189201Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.6189500Z     )
2025-05-07T20:32:53.6189848Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.6190295Z     def test_silu_mul_quant(
2025-05-07T20:32:53.6190550Z         self,
2025-05-07T20:32:53.6190743Z         T: int,
2025-05-07T20:32:53.6190947Z         D: int,
2025-05-07T20:32:53.6191172Z         scale_ub: Optional[float],
2025-05-07T20:32:53.6191443Z         contiguous: bool,
2025-05-07T20:32:53.6191687Z         compiled: bool,
2025-05-07T20:32:53.6191913Z     ) -> None:
2025-05-07T20:32:53.6192126Z         torch.manual_seed(2025)
2025-05-07T20:32:53.6192366Z     
2025-05-07T20:32:53.6192637Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.6194641Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.6196474Z 
2025-05-07T20:32:53.6196605Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:53.6196814Z 
2025-05-07T20:32:53.6196916Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.6197333Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.6197730Z     T=4096,
2025-05-07T20:32:53.6197913Z     D=7168,
2025-05-07T20:32:53.6198112Z     scale_ub=None,
2025-05-07T20:32:53.6198334Z     contiguous=True,
2025-05-07T20:32:53.6198551Z     compiled=True,
2025-05-07T20:32:53.6198763Z )
2025-05-07T20:32:53.6199083Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.6199571Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:53.6199836Z 
2025-05-07T20:32:53.6199916Z     @given(
2025-05-07T20:32:53.6200202Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.6200516Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.6200897Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.6201233Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.6201561Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.6201843Z     )
2025-05-07T20:32:53.6202186Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.6202620Z     def test_silu_mul_quant(
2025-05-07T20:32:53.6202857Z         self,
2025-05-07T20:32:53.6203092Z         T: int,
2025-05-07T20:32:53.6203294Z         D: int,
2025-05-07T20:32:53.6203518Z         scale_ub: Optional[float],
2025-05-07T20:32:53.6203787Z         contiguous: bool,
2025-05-07T20:32:53.6204037Z         compiled: bool,
2025-05-07T20:32:53.6204262Z     ) -> None:
2025-05-07T20:32:53.6204477Z         torch.manual_seed(2025)
2025-05-07T20:32:53.6204725Z     
2025-05-07T20:32:53.6205009Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.6207058Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.6208883Z 
2025-05-07T20:32:53.6209006Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:53.6209224Z 
2025-05-07T20:32:53.6209328Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.6209754Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.6210162Z     T=2048,
2025-05-07T20:32:53.6210351Z     D=5120,
2025-05-07T20:32:53.6210556Z     scale_ub=1200.0,
2025-05-07T20:32:53.6210791Z     contiguous=False,
2025-05-07T20:32:53.6211025Z     compiled=False,
2025-05-07T20:32:53.6211235Z )
2025-05-07T20:32:53.6211571Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.6212059Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:53.6212339Z 
2025-05-07T20:32:53.6212424Z     @given(
2025-05-07T20:32:53.6212655Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.6213026Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.6213328Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.6213655Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.6213985Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.6214264Z     )
2025-05-07T20:32:53.6214616Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.6215056Z     def test_silu_mul_quant(
2025-05-07T20:32:53.6215289Z         self,
2025-05-07T20:32:53.6215497Z         T: int,
2025-05-07T20:32:53.6215698Z         D: int,
2025-05-07T20:32:53.6215914Z         scale_ub: Optional[float],
2025-05-07T20:32:53.6216187Z         contiguous: bool,
2025-05-07T20:32:53.6216434Z         compiled: bool,
2025-05-07T20:32:53.6216653Z     ) -> None:
2025-05-07T20:32:53.6216870Z         torch.manual_seed(2025)
2025-05-07T20:32:53.6217116Z     
2025-05-07T20:32:53.6217391Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.6219445Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.6221298Z 
2025-05-07T20:32:53.6221419Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:53.6221638Z 
2025-05-07T20:32:53.6221746Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.6222159Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.6222561Z     T=4096,
2025-05-07T20:32:53.6222798Z     D=7168,
2025-05-07T20:32:53.6222993Z     scale_ub=1200.0,
2025-05-07T20:32:53.6223212Z     contiguous=True,
2025-05-07T20:32:53.6223437Z     compiled=False,
2025-05-07T20:32:53.6223655Z )
2025-05-07T20:32:53.7302495Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.7303248Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:53.7303547Z 
2025-05-07T20:32:53.7303639Z     @given(
2025-05-07T20:32:53.7303876Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.7304464Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.7304791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.7305123Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.7305464Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.7305767Z     )
2025-05-07T20:32:53.7306123Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.7306594Z     def test_silu_mul_quant(
2025-05-07T20:32:53.7306858Z         self,
2025-05-07T20:32:53.7307065Z         T: int,
2025-05-07T20:32:53.7307282Z         D: int,
2025-05-07T20:32:53.7307511Z         scale_ub: Optional[float],
2025-05-07T20:32:53.7307793Z         contiguous: bool,
2025-05-07T20:32:53.7308048Z         compiled: bool,
2025-05-07T20:32:53.7308283Z     ) -> None:
2025-05-07T20:32:53.7308514Z         torch.manual_seed(2025)
2025-05-07T20:32:53.7308767Z     
2025-05-07T20:32:53.7309054Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.7311109Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.7313024Z 
2025-05-07T20:32:53.7313158Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:53.7313373Z 
2025-05-07T20:32:53.7313480Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.7313907Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.7314316Z     T=16384,
2025-05-07T20:32:53.7314520Z     D=7168,
2025-05-07T20:32:53.7314719Z     scale_ub=None,
2025-05-07T20:32:53.7314952Z     contiguous=False,
2025-05-07T20:32:53.7315188Z     compiled=True,
2025-05-07T20:32:53.7315400Z )
2025-05-07T20:32:53.7315727Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.7316227Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:53.7316504Z 
2025-05-07T20:32:53.7316585Z     @given(
2025-05-07T20:32:53.7316829Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.7317152Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.7317460Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.7317794Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.7318135Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.7318504Z     )
2025-05-07T20:32:53.7318851Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.7319379Z     def test_silu_mul_quant(
2025-05-07T20:32:53.7319632Z         self,
2025-05-07T20:32:53.7327960Z         T: int,
2025-05-07T20:32:53.7328221Z         D: int,
2025-05-07T20:32:53.7328468Z         scale_ub: Optional[float],
2025-05-07T20:32:53.7328770Z         contiguous: bool,
2025-05-07T20:32:53.7329034Z         compiled: bool,
2025-05-07T20:32:53.7329281Z     ) -> None:
2025-05-07T20:32:53.7329513Z         torch.manual_seed(2025)
2025-05-07T20:32:53.7329894Z     
2025-05-07T20:32:53.7330182Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.7332261Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.7334253Z 
2025-05-07T20:32:53.7334375Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:53.7334595Z 
2025-05-07T20:32:53.7334702Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.7335116Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.7335522Z     T=4096,
2025-05-07T20:32:53.7335707Z     D=7168,
2025-05-07T20:32:53.7335902Z     scale_ub=None,
2025-05-07T20:32:53.7336119Z     contiguous=True,
2025-05-07T20:32:53.7336340Z     compiled=False,
2025-05-07T20:32:53.7336543Z )
2025-05-07T20:32:53.7336861Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.7337335Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:53.7337608Z 
2025-05-07T20:32:53.7337688Z     @given(
2025-05-07T20:32:53.7337927Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.7338247Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.7338549Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.7338881Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.7339213Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.7339492Z     )
2025-05-07T20:32:53.7339840Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.7340284Z     def test_silu_mul_quant(
2025-05-07T20:32:53.7340527Z         self,
2025-05-07T20:32:53.7340726Z         T: int,
2025-05-07T20:32:53.7340928Z         D: int,
2025-05-07T20:32:53.7341145Z         scale_ub: Optional[float],
2025-05-07T20:32:53.7341416Z         contiguous: bool,
2025-05-07T20:32:53.7341663Z         compiled: bool,
2025-05-07T20:32:53.7341888Z     ) -> None:
2025-05-07T20:32:53.7342105Z         torch.manual_seed(2025)
2025-05-07T20:32:53.7342348Z     
2025-05-07T20:32:53.7342631Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.7344681Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.7346510Z 
2025-05-07T20:32:53.7346630Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:53.7346847Z 
2025-05-07T20:32:53.7347009Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.7347425Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.7347858Z     T=16384,
2025-05-07T20:32:53.7348063Z     D=7168,
2025-05-07T20:32:53.7348264Z     scale_ub=None,
2025-05-07T20:32:53.7348473Z     contiguous=True,
2025-05-07T20:32:53.7348704Z     compiled=False,
2025-05-07T20:32:53.7348912Z )
2025-05-07T20:32:53.7349230Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.7349716Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:53.7350037Z 
2025-05-07T20:32:53.7350118Z     @given(
2025-05-07T20:32:53.7350351Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.7350658Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.7350971Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.7351306Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.7351630Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.7351919Z     )
2025-05-07T20:32:53.7352314Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.7352785Z     def test_silu_mul_quant(
2025-05-07T20:32:53.7353046Z         self,
2025-05-07T20:32:53.7353243Z         T: int,
2025-05-07T20:32:53.7353453Z         D: int,
2025-05-07T20:32:53.7353668Z         scale_ub: Optional[float],
2025-05-07T20:32:53.7353945Z         contiguous: bool,
2025-05-07T20:32:53.7354193Z         compiled: bool,
2025-05-07T20:32:53.7354414Z     ) -> None:
2025-05-07T20:32:53.7354632Z         torch.manual_seed(2025)
2025-05-07T20:32:53.7354882Z     
2025-05-07T20:32:53.7355150Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.7357167Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.7358996Z 
2025-05-07T20:32:53.7359117Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:53.7359796Z 
2025-05-07T20:32:53.7359905Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.7360322Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.7360716Z     T=16384,
2025-05-07T20:32:53.7360908Z     D=7168,
2025-05-07T20:32:53.7361110Z     scale_ub=1200.0,
2025-05-07T20:32:53.7361325Z     contiguous=True,
2025-05-07T20:32:53.7361549Z     compiled=False,
2025-05-07T20:32:53.7361754Z )
2025-05-07T20:32:53.7362070Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.7362566Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:53.7362865Z 
2025-05-07T20:32:53.7362965Z     @given(
2025-05-07T20:32:53.7363219Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.7363527Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.7363837Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.7364168Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.7364492Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.7364787Z     )
2025-05-07T20:32:53.7365138Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.7365570Z     def test_silu_mul_quant(
2025-05-07T20:32:53.7365816Z         self,
2025-05-07T20:32:53.7366019Z         T: int,
2025-05-07T20:32:53.7366219Z         D: int,
2025-05-07T20:32:53.7366447Z         scale_ub: Optional[float],
2025-05-07T20:32:53.7366810Z         contiguous: bool,
2025-05-07T20:32:53.7367046Z         compiled: bool,
2025-05-07T20:32:53.7367278Z     ) -> None:
2025-05-07T20:32:53.7367554Z         torch.manual_seed(2025)
2025-05-07T20:32:53.7367800Z     
2025-05-07T20:32:53.7368077Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.7370093Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.7371982Z 
2025-05-07T20:32:53.7372103Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:53.7372315Z 
2025-05-07T20:32:53.7372436Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.7372928Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.7373414Z     T=128,
2025-05-07T20:32:53.7373609Z     D=5120,
2025-05-07T20:32:53.7373806Z     scale_ub=1200.0,
2025-05-07T20:32:53.7374034Z     contiguous=False,
2025-05-07T20:32:53.7374270Z     compiled=False,
2025-05-07T20:32:53.7374481Z )
2025-05-07T20:32:53.8662879Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.8663473Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:53.8663759Z 
2025-05-07T20:32:53.8663840Z     @given(
2025-05-07T20:32:53.8664084Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.8664398Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.8664716Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.8665060Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.8665392Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.8665695Z     )
2025-05-07T20:32:53.8666060Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.8666514Z     def test_silu_mul_quant(
2025-05-07T20:32:53.8666756Z         self,
2025-05-07T20:32:53.8666964Z         T: int,
2025-05-07T20:32:53.8667165Z         D: int,
2025-05-07T20:32:53.8667388Z         scale_ub: Optional[float],
2025-05-07T20:32:53.8667662Z         contiguous: bool,
2025-05-07T20:32:53.8667906Z         compiled: bool,
2025-05-07T20:32:53.8668129Z     ) -> None:
2025-05-07T20:32:53.8668347Z         torch.manual_seed(2025)
2025-05-07T20:32:53.8668592Z     
2025-05-07T20:32:53.8668863Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.8669213Z     
2025-05-07T20:32:53.8669415Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.8669705Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.8670019Z         x = x_sign * x_clamp
2025-05-07T20:32:53.8670267Z         x0 = x[:, :D]
2025-05-07T20:32:53.8670489Z         x1 = x[:, D:]
2025-05-07T20:32:53.8670707Z     
2025-05-07T20:32:53.8670899Z         if contiguous:
2025-05-07T20:32:53.8671137Z             x0 = x0.contiguous()
2025-05-07T20:32:53.8671404Z             x1 = x1.contiguous()
2025-05-07T20:32:53.8671658Z     
2025-05-07T20:32:53.8671861Z         if scale_ub is not None:
2025-05-07T20:32:53.8672135Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.8672487Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.8672805Z             )
2025-05-07T20:32:53.8673000Z         else:
2025-05-07T20:32:53.8673224Z             scale_ub_tensor = None
2025-05-07T20:32:53.8673485Z     
2025-05-07T20:32:53.8673719Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.8674041Z             op = silu_mul_quant
2025-05-07T20:32:53.8674575Z             if compiled:
2025-05-07T20:32:53.8674829Z                 op = torch.compile(op)
2025-05-07T20:32:53.8675214Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.8675502Z     
2025-05-07T20:32:53.8675699Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.8675863Z 
2025-05-07T20:32:53.8675965Z moe/activation_test.py:117: 
2025-05-07T20:32:53.8676265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.8676597Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.8676873Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.8677646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.8678331Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.8678875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.8679548Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.8680283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.8680817Z     kernel = self.compile(
2025-05-07T20:32:53.8681364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.8682014Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.8682413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.8682641Z 
2025-05-07T20:32:53.8682859Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79eb234a0>
2025-05-07T20:32:53.8683936Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.8685305Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79e8247c0>}
2025-05-07T20:32:53.8686636Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.8687649Z context = <triton._C.libtriton.ir.context object at 0x7fc79e814330>
2025-05-07T20:32:53.8687935Z 
2025-05-07T20:32:53.8688108Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.8688623Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.8689093Z                            module_map=module_map)
2025-05-07T20:32:53.8689459Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.8689812Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.8690070Z E       ^
2025-05-07T20:32:53.8690536Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.8690989Z 
2025-05-07T20:32:53.8691407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.8691912Z 
2025-05-07T20:32:53.8692020Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.8692439Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.8692865Z     T=2048,
2025-05-07T20:32:53.8693220Z     D=7168,
2025-05-07T20:32:53.8693413Z     scale_ub=None,
2025-05-07T20:32:53.8693633Z     contiguous=False,
2025-05-07T20:32:53.8693859Z     compiled=False,
2025-05-07T20:32:53.8694070Z )
2025-05-07T20:32:53.8694395Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.8694890Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:53.8695242Z 
2025-05-07T20:32:53.8695321Z     @given(
2025-05-07T20:32:53.8695599Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.8695920Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.8696226Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.8696562Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.8696903Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.8697191Z     )
2025-05-07T20:32:53.8697547Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.8698039Z     def test_silu_mul_quant(
2025-05-07T20:32:53.8698291Z         self,
2025-05-07T20:32:53.8698498Z         T: int,
2025-05-07T20:32:53.8698705Z         D: int,
2025-05-07T20:32:53.8698936Z         scale_ub: Optional[float],
2025-05-07T20:32:53.8699205Z         contiguous: bool,
2025-05-07T20:32:53.8699452Z         compiled: bool,
2025-05-07T20:32:53.8699684Z     ) -> None:
2025-05-07T20:32:53.8699901Z         torch.manual_seed(2025)
2025-05-07T20:32:53.8700150Z     
2025-05-07T20:32:53.8700477Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.8702514Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.8704402Z 
2025-05-07T20:32:53.8704524Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:53.8704748Z 
2025-05-07T20:32:53.8704853Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.8705273Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.8705681Z     T=128,
2025-05-07T20:32:53.8705876Z     D=7168,
2025-05-07T20:32:53.8706084Z     scale_ub=1200.0,
2025-05-07T20:32:53.8706319Z     contiguous=True,
2025-05-07T20:32:53.8706548Z     compiled=True,
2025-05-07T20:32:53.8706764Z )
2025-05-07T20:32:53.9019956Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.9020616Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:53.9020885Z 
2025-05-07T20:32:53.9020987Z     @given(
2025-05-07T20:32:53.9021216Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.9021530Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.9021843Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.9022168Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.9022507Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.9022841Z     )
2025-05-07T20:32:53.9023222Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.9023679Z     def test_silu_mul_quant(
2025-05-07T20:32:53.9023938Z         self,
2025-05-07T20:32:53.9024145Z         T: int,
2025-05-07T20:32:53.9024345Z         D: int,
2025-05-07T20:32:53.9024575Z         scale_ub: Optional[float],
2025-05-07T20:32:53.9024861Z         contiguous: bool,
2025-05-07T20:32:53.9025107Z         compiled: bool,
2025-05-07T20:32:53.9025345Z     ) -> None:
2025-05-07T20:32:53.9025575Z         torch.manual_seed(2025)
2025-05-07T20:32:53.9025820Z     
2025-05-07T20:32:53.9026101Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.9026449Z     
2025-05-07T20:32:53.9026645Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.9026941Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.9027268Z         x = x_sign * x_clamp
2025-05-07T20:32:53.9027682Z         x0 = x[:, :D]
2025-05-07T20:32:53.9027914Z         x1 = x[:, D:]
2025-05-07T20:32:53.9028132Z     
2025-05-07T20:32:53.9028317Z         if contiguous:
2025-05-07T20:32:53.9028634Z             x0 = x0.contiguous()
2025-05-07T20:32:53.9028900Z             x1 = x1.contiguous()
2025-05-07T20:32:53.9029141Z     
2025-05-07T20:32:53.9029333Z         if scale_ub is not None:
2025-05-07T20:32:53.9029612Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.9029949Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.9030259Z             )
2025-05-07T20:32:53.9030529Z         else:
2025-05-07T20:32:53.9030749Z             scale_ub_tensor = None
2025-05-07T20:32:53.9031001Z     
2025-05-07T20:32:53.9031241Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.9031556Z             op = silu_mul_quant
2025-05-07T20:32:53.9031808Z             if compiled:
2025-05-07T20:32:53.9032068Z                 op = torch.compile(op)
2025-05-07T20:32:53.9032375Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.9032648Z     
2025-05-07T20:32:53.9032857Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.9033030Z 
2025-05-07T20:32:53.9033200Z moe/activation_test.py:117: 
2025-05-07T20:32:53.9033502Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.9033833Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.9034120Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.9034684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.9035243Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.9035902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.9036585Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.9037123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.9037798Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.9038472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.9039005Z     kernel = self.compile(
2025-05-07T20:32:53.9039548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.9040203Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.9040607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.9040837Z 
2025-05-07T20:32:53.9041053Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79e8c6330>
2025-05-07T20:32:53.9042118Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.9043488Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79e825940>}
2025-05-07T20:32:53.9044811Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.9045823Z context = <triton._C.libtriton.ir.context object at 0x7fc79e9ba9f0>
2025-05-07T20:32:53.9046111Z 
2025-05-07T20:32:53.9046285Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.9046803Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.9047274Z                            module_map=module_map)
2025-05-07T20:32:53.9047648Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.9048051Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.9048316Z E       ^
2025-05-07T20:32:53.9048826Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.9049272Z 
2025-05-07T20:32:53.9049693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.9050196Z 
2025-05-07T20:32:53.9050301Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.9050716Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.9051156Z     T=128,
2025-05-07T20:32:53.9051343Z     D=7168,
2025-05-07T20:32:53.9051546Z     scale_ub=1200.0,
2025-05-07T20:32:53.9051774Z     contiguous=True,
2025-05-07T20:32:53.9051995Z     compiled=False,
2025-05-07T20:32:53.9052209Z )
2025-05-07T20:32:53.9052534Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.9053153Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:53.9053420Z 
2025-05-07T20:32:53.9053498Z     @given(
2025-05-07T20:32:53.9053779Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.9054099Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.9054403Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.9054735Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.9055069Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.9055355Z     )
2025-05-07T20:32:53.9055710Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.9056154Z     def test_silu_mul_quant(
2025-05-07T20:32:53.9056401Z         self,
2025-05-07T20:32:53.9056594Z         T: int,
2025-05-07T20:32:53.9056801Z         D: int,
2025-05-07T20:32:53.9057026Z         scale_ub: Optional[float],
2025-05-07T20:32:53.9057296Z         contiguous: bool,
2025-05-07T20:32:53.9057545Z         compiled: bool,
2025-05-07T20:32:53.9057769Z     ) -> None:
2025-05-07T20:32:53.9057984Z         torch.manual_seed(2025)
2025-05-07T20:32:53.9058233Z     
2025-05-07T20:32:53.9058516Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.9058859Z     
2025-05-07T20:32:53.9059065Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.9059684Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.9061681Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.9063512Z 
2025-05-07T20:32:53.9063644Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:53.9063858Z 
2025-05-07T20:32:53.9063973Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.9064393Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.9064798Z     T=128,
2025-05-07T20:32:53.9064984Z     D=5120,
2025-05-07T20:32:53.9065183Z     scale_ub=1200.0,
2025-05-07T20:32:53.9065415Z     contiguous=True,
2025-05-07T20:32:53.9065640Z     compiled=True,
2025-05-07T20:32:53.9065857Z )
2025-05-07T20:32:53.9066179Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.9066663Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:53.9066933Z 
2025-05-07T20:32:53.9067011Z     @given(
2025-05-07T20:32:53.9067242Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.9067558Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.9067954Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.9068354Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.9068690Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.9068973Z     )
2025-05-07T20:32:53.9069325Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.9069768Z     def test_silu_mul_quant(
2025-05-07T20:32:53.9070012Z         self,
2025-05-07T20:32:53.9070210Z         T: int,
2025-05-07T20:32:53.9070409Z         D: int,
2025-05-07T20:32:53.9070722Z         scale_ub: Optional[float],
2025-05-07T20:32:53.9070998Z         contiguous: bool,
2025-05-07T20:32:53.9071244Z         compiled: bool,
2025-05-07T20:32:53.9071469Z     ) -> None:
2025-05-07T20:32:53.9071692Z         torch.manual_seed(2025)
2025-05-07T20:32:53.9071937Z     
2025-05-07T20:32:53.9072212Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.9072554Z     
2025-05-07T20:32:53.9072756Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.9073055Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.9075111Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.9076936Z 
2025-05-07T20:32:53.9077059Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:53.9077278Z 
2025-05-07T20:32:53.9077382Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.9077794Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.9078198Z     T=128,
2025-05-07T20:32:53.9078384Z     D=7168,
2025-05-07T20:32:53.9078587Z     scale_ub=None,
2025-05-07T20:32:53.9078810Z     contiguous=True,
2025-05-07T20:32:53.9079031Z     compiled=True,
2025-05-07T20:32:53.9079240Z )
2025-05-07T20:32:54.1574812Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.1575360Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.1575629Z 
2025-05-07T20:32:54.1575722Z     @given(
2025-05-07T20:32:54.1575977Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.1576300Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.1576612Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.1576940Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.1577270Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.1577559Z     )
2025-05-07T20:32:54.1577918Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.1578369Z     def test_silu_mul_quant(
2025-05-07T20:32:54.1578635Z         self,
2025-05-07T20:32:54.1578838Z         T: int,
2025-05-07T20:32:54.1579034Z         D: int,
2025-05-07T20:32:54.1579259Z         scale_ub: Optional[float],
2025-05-07T20:32:54.1579539Z         contiguous: bool,
2025-05-07T20:32:54.1579781Z         compiled: bool,
2025-05-07T20:32:54.1580023Z     ) -> None:
2025-05-07T20:32:54.1580247Z         torch.manual_seed(2025)
2025-05-07T20:32:54.1580498Z     
2025-05-07T20:32:54.1580788Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1583111Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.1585131Z 
2025-05-07T20:32:54.1585256Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.1585480Z 
2025-05-07T20:32:54.1600295Z FAILED
2025-05-07T20:32:54.1600539Z 
2025-05-07T20:32:54.1600780Z =================================== FAILURES ===================================
2025-05-07T20:32:54.1601625Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:54.1602247Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:54.1603031Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:32:54.1603585Z   |     yield
2025-05-07T20:32:54.1604130Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run
2025-05-07T20:32:54.1604857Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:54.1605868Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
2025-05-07T20:32:54.1606697Z   |     if method() is not None:
2025-05-07T20:32:54.1607043Z   |        ^^^^^^^^
2025-05-07T20:32:54.1607982Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:54.1609267Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.1609694Z   |            ^^^^^^^
2025-05-07T20:32:54.1610512Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:54.1611424Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:54.1612029Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:54.1612649Z   +-+---------------- 1 ----------------
2025-05-07T20:32:54.1613253Z     | Traceback (most recent call last):
2025-05-07T20:32:54.1614284Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:54.1615413Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1615954Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.1618867Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.1621791Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:54.1622427Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1623022Z     |     T=2048,
2025-05-07T20:32:54.1623363Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:54.1623851Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:54.1624383Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:54.1624915Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:54.1625352Z     | )
2025-05-07T20:32:54.1625599Z     | 
2025-05-07T20:32:54.1626364Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:54.1627261Z     +---------------- 2 ----------------
2025-05-07T20:32:54.1627754Z     | Traceback (most recent call last):
2025-05-07T20:32:54.1628841Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:54.1629986Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1630533Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.1633874Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.1637688Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:54.1638544Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1639259Z     |     T=128,
2025-05-07T20:32:54.1639603Z     |     D=7168,
2025-05-07T20:32:54.1639959Z     |     scale_ub=None,
2025-05-07T20:32:54.1640361Z     |     contiguous=True,
2025-05-07T20:32:54.1640763Z     |     compiled=True,
2025-05-07T20:32:54.1641135Z     | )
2025-05-07T20:32:54.1641449Z     | 
2025-05-07T20:32:54.1642382Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:54.1643504Z     +---------------- 3 ----------------
2025-05-07T20:32:54.1643980Z     | Traceback (most recent call last):
2025-05-07T20:32:54.1645125Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:54.1646411Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1647004Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.1650120Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.1653183Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:54.1653828Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1654418Z     |     T=128,
2025-05-07T20:32:54.1654709Z     |     D=5120,
2025-05-07T20:32:54.1655019Z     |     scale_ub=1200.0,
2025-05-07T20:32:54.1655294Z     |     contiguous=True,
2025-05-07T20:32:54.1655551Z     |     compiled=True,
2025-05-07T20:32:54.1655787Z     | )
2025-05-07T20:32:54.1655982Z     | 
2025-05-07T20:32:54.1656516Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:54.1657118Z     +---------------- 4 ----------------
2025-05-07T20:32:54.1657421Z     | Traceback (most recent call last):
2025-05-07T20:32:54.1658140Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:54.1658849Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.1659142Z     |                              ^^^^^^^^
2025-05-07T20:32:54.1660263Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:54.1660957Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.1661358Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.1662227Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:54.1663007Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.1663692Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:54.1664406Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.1664842Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.1665477Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:54.1666299Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.1666754Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.1667380Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:54.1668064Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.1668430Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.1669019Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:54.1669574Z     |     fn()
2025-05-07T20:32:54.1670133Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:54.1670747Z     |     self.fn.run(
2025-05-07T20:32:54.1671270Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:54.1671837Z     |     kernel = self.compile(
2025-05-07T20:32:54.1672091Z     |              ^^^^^^^^^^^^^
2025-05-07T20:32:54.1672681Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:54.1673426Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.1673819Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.1674443Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:54.1675219Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.1675695Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.1676067Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.1676418Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.1676680Z     | ^
2025-05-07T20:32:54.1677133Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.1677685Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:54.1678086Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:54.1678592Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1679018Z     |     T=1,  # or any other generated value
2025-05-07T20:32:54.1679329Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:54.1679737Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:54.1680095Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:54.1680495Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:54.1680796Z     | )
2025-05-07T20:32:54.1680975Z     | 
2025-05-07T20:32:54.1681490Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:54.1682123Z     +------------------------------------
2025-05-07T20:32:54.1682631Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:54.1683214Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.1683789Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1684340Z     T=1,
2025-05-07T20:32:54.1684602Z     D=5120,
2025-05-07T20:32:54.1684864Z     scale_ub=None,
2025-05-07T20:32:54.1685166Z     contiguous=True,
2025-05-07T20:32:54.1685489Z     compiled=True,
2025-05-07T20:32:54.1685779Z )
2025-05-07T20:32:54.1686222Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.1686940Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.1687300Z 
2025-05-07T20:32:54.1687413Z     @given(
2025-05-07T20:32:54.1687736Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.1688174Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.1688596Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.1689064Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.1689529Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.1689933Z     )
2025-05-07T20:32:54.1690416Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.1691032Z     def test_silu_mul_quant(
2025-05-07T20:32:54.1691379Z         self,
2025-05-07T20:32:54.1810281Z         T: int,
2025-05-07T20:32:54.1810597Z         D: int,
2025-05-07T20:32:54.1810901Z         scale_ub: Optional[float],
2025-05-07T20:32:54.1811306Z         contiguous: bool,
2025-05-07T20:32:54.1811657Z         compiled: bool,
2025-05-07T20:32:54.1811971Z     ) -> None:
2025-05-07T20:32:54.1812251Z         torch.manual_seed(2025)
2025-05-07T20:32:54.1812569Z     
2025-05-07T20:32:54.1812931Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1813461Z     
2025-05-07T20:32:54.1813714Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.1814110Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.1814518Z         x = x_sign * x_clamp
2025-05-07T20:32:54.1814839Z         x0 = x[:, :D]
2025-05-07T20:32:54.1815133Z         x1 = x[:, D:]
2025-05-07T20:32:54.1815414Z     
2025-05-07T20:32:54.1815658Z         if contiguous:
2025-05-07T20:32:54.1815969Z             x0 = x0.contiguous()
2025-05-07T20:32:54.1816316Z             x1 = x1.contiguous()
2025-05-07T20:32:54.1816635Z     
2025-05-07T20:32:54.1816892Z         if scale_ub is not None:
2025-05-07T20:32:54.1817260Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.1817707Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.1818116Z             )
2025-05-07T20:32:54.1818377Z         else:
2025-05-07T20:32:54.1818652Z             scale_ub_tensor = None
2025-05-07T20:32:54.1818996Z     
2025-05-07T20:32:54.1819322Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.1819762Z             op = silu_mul_quant
2025-05-07T20:32:54.1820119Z             if compiled:
2025-05-07T20:32:54.1820464Z                 op = torch.compile(op)
2025-05-07T20:32:54.1820867Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.1821258Z     
2025-05-07T20:32:54.1821522Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.1821916Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.1822296Z     
2025-05-07T20:32:54.1822912Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.1823262Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.1823682Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.1824000Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.1824357Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.1824659Z     
2025-05-07T20:32:54.1824860Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.1825052Z 
2025-05-07T20:32:54.1825159Z moe/activation_test.py:126: 
2025-05-07T20:32:54.1825544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.1825879Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.1826205Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.1826995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.1827739Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.1828368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.1829048Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.1829731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.1830443Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.1831177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.1831810Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.1832399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.1832912Z     fn()
2025-05-07T20:32:54.1833459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.1834046Z     self.fn.run(
2025-05-07T20:32:54.1834510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.1835043Z     kernel = self.compile(
2025-05-07T20:32:54.1835582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.1836221Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.1836629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.1836863Z 
2025-05-07T20:32:54.1837069Z self = <triton.compiler.compiler.ASTSource object at 0x7fc892726240>
2025-05-07T20:32:54.1838144Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.1839542Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc89114dc60>}
2025-05-07T20:32:54.1840864Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.1841882Z context = <triton._C.libtriton.ir.context object at 0x7fc8914130b0>
2025-05-07T20:32:54.1842176Z 
2025-05-07T20:32:54.1842340Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.1842860Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.1843318Z                            module_map=module_map)
2025-05-07T20:32:54.1843693Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.1844099Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.1844359Z E       ^
2025-05-07T20:32:54.1844872Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.1845320Z 
2025-05-07T20:32:54.1845733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.1846235Z 
2025-05-07T20:32:54.1846343Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.1846792Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1847185Z     T=2048,
2025-05-07T20:32:54.1847373Z     D=5120,
2025-05-07T20:32:54.1847560Z     scale_ub=1200.0,
2025-05-07T20:32:54.1847784Z     contiguous=True,
2025-05-07T20:32:54.1848002Z     compiled=False,
2025-05-07T20:32:54.1848207Z )
2025-05-07T20:32:54.1848521Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.1849011Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.1849280Z 
2025-05-07T20:32:54.1849409Z     @given(
2025-05-07T20:32:54.1849638Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.1849948Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.1850253Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.1850573Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.1850903Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.1851197Z     )
2025-05-07T20:32:54.1851544Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.1851986Z     def test_silu_mul_quant(
2025-05-07T20:32:54.1852229Z         self,
2025-05-07T20:32:54.1852417Z         T: int,
2025-05-07T20:32:54.1852614Z         D: int,
2025-05-07T20:32:54.1852838Z         scale_ub: Optional[float],
2025-05-07T20:32:54.1853240Z         contiguous: bool,
2025-05-07T20:32:54.1853475Z         compiled: bool,
2025-05-07T20:32:54.1853698Z     ) -> None:
2025-05-07T20:32:54.1853918Z         torch.manual_seed(2025)
2025-05-07T20:32:54.1854159Z     
2025-05-07T20:32:54.1854435Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1854777Z     
2025-05-07T20:32:54.1854970Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.1855264Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.1855575Z         x = x_sign * x_clamp
2025-05-07T20:32:54.1855811Z         x0 = x[:, :D]
2025-05-07T20:32:54.1856036Z         x1 = x[:, D:]
2025-05-07T20:32:54.1856248Z     
2025-05-07T20:32:54.1856433Z         if contiguous:
2025-05-07T20:32:54.1856669Z             x0 = x0.contiguous()
2025-05-07T20:32:54.1856932Z             x1 = x1.contiguous()
2025-05-07T20:32:54.1857170Z     
2025-05-07T20:32:54.1857368Z         if scale_ub is not None:
2025-05-07T20:32:54.1857648Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.1857983Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.1858295Z             )
2025-05-07T20:32:54.1858493Z         else:
2025-05-07T20:32:54.1858709Z             scale_ub_tensor = None
2025-05-07T20:32:54.1858959Z     
2025-05-07T20:32:54.1859487Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.1859853Z             op = silu_mul_quant
2025-05-07T20:32:54.1860098Z             if compiled:
2025-05-07T20:32:54.1860348Z                 op = torch.compile(op)
2025-05-07T20:32:54.1860647Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.1860921Z     
2025-05-07T20:32:54.1861120Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.1861283Z 
2025-05-07T20:32:54.1861389Z moe/activation_test.py:117: 
2025-05-07T20:32:54.1861678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.1862011Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.1862405Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.1863202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.1863878Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.1864413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.1865085Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.1865738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.1866329Z     kernel = self.compile(
2025-05-07T20:32:54.1866867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.1867509Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.1867895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.1868134Z 
2025-05-07T20:32:54.1868341Z self = <triton.compiler.compiler.ASTSource object at 0x7fc891269ee0>
2025-05-07T20:32:54.1869456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.1870808Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc890db0220>}
2025-05-07T20:32:54.1872120Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.1873142Z context = <triton._C.libtriton.ir.context object at 0x7fc8913cfa30>
2025-05-07T20:32:54.1873437Z 
2025-05-07T20:32:54.1873602Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.1874120Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.1874581Z                            module_map=module_map)
2025-05-07T20:32:54.1891963Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.1892437Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.1892696Z E       ^
2025-05-07T20:32:54.1893250Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.1893710Z 
2025-05-07T20:32:54.1894134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.1894641Z 
2025-05-07T20:32:54.1894755Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.1895162Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1895567Z     T=2048,
2025-05-07T20:32:54.1895758Z     D=5120,
2025-05-07T20:32:54.1895946Z     scale_ub=1200.0,
2025-05-07T20:32:54.1896181Z     contiguous=True,
2025-05-07T20:32:54.1896409Z     compiled=True,
2025-05-07T20:32:54.1896616Z )
2025-05-07T20:32:54.1896928Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.1897413Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.1897680Z 
2025-05-07T20:32:54.1897763Z     @given(
2025-05-07T20:32:54.1897988Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.1898307Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.1898614Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.1898940Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.1899267Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.1899557Z     )
2025-05-07T20:32:54.1899904Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.1900435Z     def test_silu_mul_quant(
2025-05-07T20:32:54.1900719Z         self,
2025-05-07T20:32:54.1900916Z         T: int,
2025-05-07T20:32:54.1901113Z         D: int,
2025-05-07T20:32:54.1901335Z         scale_ub: Optional[float],
2025-05-07T20:32:54.1901609Z         contiguous: bool,
2025-05-07T20:32:54.1901842Z         compiled: bool,
2025-05-07T20:32:54.1902071Z     ) -> None:
2025-05-07T20:32:54.1902285Z         torch.manual_seed(2025)
2025-05-07T20:32:54.1902521Z     
2025-05-07T20:32:54.1902793Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1903183Z     
2025-05-07T20:32:54.1903371Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.1903659Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.1903968Z         x = x_sign * x_clamp
2025-05-07T20:32:54.1904203Z         x0 = x[:, :D]
2025-05-07T20:32:54.1904424Z         x1 = x[:, D:]
2025-05-07T20:32:54.1904643Z     
2025-05-07T20:32:54.1904826Z         if contiguous:
2025-05-07T20:32:54.1905057Z             x0 = x0.contiguous()
2025-05-07T20:32:54.1905359Z             x1 = x1.contiguous()
2025-05-07T20:32:54.1905609Z     
2025-05-07T20:32:54.1905797Z         if scale_ub is not None:
2025-05-07T20:32:54.1906072Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.1906411Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.1906715Z             )
2025-05-07T20:32:54.1906903Z         else:
2025-05-07T20:32:54.1907111Z             scale_ub_tensor = None
2025-05-07T20:32:54.1907351Z     
2025-05-07T20:32:54.1907573Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.1907879Z             op = silu_mul_quant
2025-05-07T20:32:54.1908119Z             if compiled:
2025-05-07T20:32:54.1908373Z                 op = torch.compile(op)
2025-05-07T20:32:54.1908662Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.1908925Z     
2025-05-07T20:32:54.1909116Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.1909404Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.1909680Z     
2025-05-07T20:32:54.1909923Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.1910256Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.1910542Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.1910846Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.1911194Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.1911507Z     
2025-05-07T20:32:54.1911700Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.1911893Z 
2025-05-07T20:32:54.1911995Z moe/activation_test.py:126: 
2025-05-07T20:32:54.1912291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.1912615Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.1912939Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.1913721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.1914462Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.1914998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.1915668Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.1916341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.1917047Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.1917756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.1918386Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.1919027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.1919575Z     fn()
2025-05-07T20:32:54.1920072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.1920653Z     self.fn.run(
2025-05-07T20:32:54.1921110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.1921621Z     kernel = self.compile(
2025-05-07T20:32:54.1922146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.1922826Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.1923211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.1923440Z 
2025-05-07T20:32:54.1923641Z self = <triton.compiler.compiler.ASTSource object at 0x7fc890dfd6d0>
2025-05-07T20:32:54.1924749Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.1926099Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc890db16c0>}
2025-05-07T20:32:54.1927418Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.1928419Z context = <triton._C.libtriton.ir.context object at 0x7fc88bb333b0>
2025-05-07T20:32:54.1928714Z 
2025-05-07T20:32:54.1928876Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.1929388Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.1929853Z                            module_map=module_map)
2025-05-07T20:32:54.1930212Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.1930565Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.1930831Z E       ^
2025-05-07T20:32:54.1931280Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.1931722Z 
2025-05-07T20:32:54.1932130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.1932639Z 
2025-05-07T20:32:54.1932740Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.1933257Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1933698Z     T=16384,
2025-05-07T20:32:54.1933894Z     D=7168,
2025-05-07T20:32:54.1934086Z     scale_ub=1200.0,
2025-05-07T20:32:54.1934303Z     contiguous=False,
2025-05-07T20:32:54.1934534Z     compiled=False,
2025-05-07T20:32:54.1934737Z )
2025-05-07T20:32:54.1935046Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.1935540Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.1935823Z 
2025-05-07T20:32:54.1935899Z     @given(
2025-05-07T20:32:54.1936127Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.1936427Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.1936727Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.1937052Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.1937366Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.1937651Z     )
2025-05-07T20:32:54.1937996Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.1938422Z     def test_silu_mul_quant(
2025-05-07T20:32:54.1938718Z         self,
2025-05-07T20:32:54.1938910Z         T: int,
2025-05-07T20:32:54.1939098Z         D: int,
2025-05-07T20:32:54.1939356Z         scale_ub: Optional[float],
2025-05-07T20:32:54.1939638Z         contiguous: bool,
2025-05-07T20:32:54.1939880Z         compiled: bool,
2025-05-07T20:32:54.1940097Z     ) -> None:
2025-05-07T20:32:54.1940316Z         torch.manual_seed(2025)
2025-05-07T20:32:54.1940561Z     
2025-05-07T20:32:54.1940821Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1941161Z     
2025-05-07T20:32:54.1941362Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.1941691Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.1942004Z         x = x_sign * x_clamp
2025-05-07T20:32:54.1942249Z         x0 = x[:, :D]
2025-05-07T20:32:54.1942462Z         x1 = x[:, D:]
2025-05-07T20:32:54.1942680Z     
2025-05-07T20:32:54.1942895Z         if contiguous:
2025-05-07T20:32:54.1943151Z             x0 = x0.contiguous()
2025-05-07T20:32:54.1943418Z             x1 = x1.contiguous()
2025-05-07T20:32:54.1943661Z     
2025-05-07T20:32:54.1943846Z         if scale_ub is not None:
2025-05-07T20:32:54.1944159Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.1944499Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.1944794Z             )
2025-05-07T20:32:54.1944987Z         else:
2025-05-07T20:32:54.1945202Z             scale_ub_tensor = None
2025-05-07T20:32:54.1945449Z     
2025-05-07T20:32:54.1945672Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.1945982Z             op = silu_mul_quant
2025-05-07T20:32:54.1946233Z             if compiled:
2025-05-07T20:32:54.1946475Z                 op = torch.compile(op)
2025-05-07T20:32:54.1946771Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.1947044Z     
2025-05-07T20:32:54.1947232Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.1947405Z 
2025-05-07T20:32:54.1947503Z moe/activation_test.py:117: 
2025-05-07T20:32:54.1947803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.1948139Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.1948420Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.1949105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.1949789Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.1950324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.1950998Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.1951652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.1952179Z     kernel = self.compile(
2025-05-07T20:32:54.1952703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.1953355Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.1953760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.1953982Z 
2025-05-07T20:32:54.1954195Z self = <triton.compiler.compiler.ASTSource object at 0x7fc890eb3e90>
2025-05-07T20:32:54.1955251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.1956604Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88bd28540>}
2025-05-07T20:32:54.1957922Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.1959024Z context = <triton._C.libtriton.ir.context object at 0x7fc88b76a970>
2025-05-07T20:32:54.1959582Z 
2025-05-07T20:32:54.1959756Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.1960263Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.1960727Z                            module_map=module_map)
2025-05-07T20:32:54.1961093Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.1961594Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.1961859Z E       ^
2025-05-07T20:32:54.1962318Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.1962755Z 
2025-05-07T20:32:54.1963169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.1963670Z 
2025-05-07T20:32:54.1963776Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.1964252Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1964658Z     T=1,
2025-05-07T20:32:54.1964843Z     D=7168,
2025-05-07T20:32:54.1965042Z     scale_ub=None,
2025-05-07T20:32:54.1965259Z     contiguous=True,
2025-05-07T20:32:54.1965471Z     compiled=True,
2025-05-07T20:32:54.1965689Z )
2025-05-07T20:32:54.1966010Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.1966486Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.1966749Z 
2025-05-07T20:32:54.1966827Z     @given(
2025-05-07T20:32:54.1967065Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.1967378Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.1967686Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.1968014Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.1968345Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.1968626Z     )
2025-05-07T20:32:54.1968977Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.1969416Z     def test_silu_mul_quant(
2025-05-07T20:32:54.1969652Z         self,
2025-05-07T20:32:54.1969855Z         T: int,
2025-05-07T20:32:54.1970057Z         D: int,
2025-05-07T20:32:54.1970272Z         scale_ub: Optional[float],
2025-05-07T20:32:54.1970551Z         contiguous: bool,
2025-05-07T20:32:54.1970796Z         compiled: bool,
2025-05-07T20:32:54.1971014Z     ) -> None:
2025-05-07T20:32:54.1971234Z         torch.manual_seed(2025)
2025-05-07T20:32:54.1971482Z     
2025-05-07T20:32:54.1971749Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1972096Z     
2025-05-07T20:32:54.1972294Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.1972585Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.1972892Z         x = x_sign * x_clamp
2025-05-07T20:32:54.1973198Z         x0 = x[:, :D]
2025-05-07T20:32:54.1973423Z         x1 = x[:, D:]
2025-05-07T20:32:54.1973631Z     
2025-05-07T20:32:54.1973825Z         if contiguous:
2025-05-07T20:32:54.1974061Z             x0 = x0.contiguous()
2025-05-07T20:32:54.1974317Z             x1 = x1.contiguous()
2025-05-07T20:32:54.1974563Z     
2025-05-07T20:32:54.1974768Z         if scale_ub is not None:
2025-05-07T20:32:54.1975040Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.1975382Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.1975700Z             )
2025-05-07T20:32:54.1975895Z         else:
2025-05-07T20:32:54.1976121Z             scale_ub_tensor = None
2025-05-07T20:32:54.1976376Z     
2025-05-07T20:32:54.1976601Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.1976918Z             op = silu_mul_quant
2025-05-07T20:32:54.1977167Z             if compiled:
2025-05-07T20:32:54.1977500Z                 op = torch.compile(op)
2025-05-07T20:32:54.1977785Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.1978123Z     
2025-05-07T20:32:54.1978317Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.1978598Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.1978886Z     
2025-05-07T20:32:54.1979124Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.1979447Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.1979738Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.1980097Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.1980450Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.1980762Z     
2025-05-07T20:32:54.1980965Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.1981155Z 
2025-05-07T20:32:54.1981260Z moe/activation_test.py:126: 
2025-05-07T20:32:54.1981546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.1981881Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.1982247Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.1983069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.1983815Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.1984355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.1985024Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.1985697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.1986408Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.1987123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.1987749Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.1988337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.1988842Z     fn()
2025-05-07T20:32:54.1989343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.1989908Z     self.fn.run(
2025-05-07T20:32:54.1990368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.1990889Z     kernel = self.compile(
2025-05-07T20:32:54.1991417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.1992055Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.1992449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.1992672Z 
2025-05-07T20:32:54.1992884Z self = <triton.compiler.compiler.ASTSource object at 0x7fc8901baea0>
2025-05-07T20:32:54.1993936Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.1995282Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88bd28e00>}
2025-05-07T20:32:54.1996599Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.1997602Z context = <triton._C.libtriton.ir.context object at 0x7fc88b5c15f0>
2025-05-07T20:32:54.1997932Z 
2025-05-07T20:32:54.1998101Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.1998648Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.1999115Z                            module_map=module_map)
2025-05-07T20:32:54.1999478Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.1999826Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.2000088Z E       ^
2025-05-07T20:32:54.2000545Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2001029Z 
2025-05-07T20:32:54.2001440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2001940Z 
2025-05-07T20:32:54.2002043Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2002448Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2002841Z     T=4096,
2025-05-07T20:32:54.2003019Z     D=5120,
2025-05-07T20:32:54.2003213Z     scale_ub=None,
2025-05-07T20:32:54.2003472Z     contiguous=False,
2025-05-07T20:32:54.2003737Z     compiled=False,
2025-05-07T20:32:54.2003944Z )
2025-05-07T20:32:54.2004259Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2004747Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.2005013Z 
2025-05-07T20:32:54.2005088Z     @given(
2025-05-07T20:32:54.2005322Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2005630Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2005924Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2006249Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2006578Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2006853Z     )
2025-05-07T20:32:54.2007198Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2007635Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2007878Z         self,
2025-05-07T20:32:54.2008070Z         T: int,
2025-05-07T20:32:54.2008268Z         D: int,
2025-05-07T20:32:54.2008486Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2008746Z         contiguous: bool,
2025-05-07T20:32:54.2008982Z         compiled: bool,
2025-05-07T20:32:54.2009203Z     ) -> None:
2025-05-07T20:32:54.2009410Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2009648Z     
2025-05-07T20:32:54.2009919Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2010250Z     
2025-05-07T20:32:54.2010451Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2010736Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2011032Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2011274Z         x0 = x[:, :D]
2025-05-07T20:32:54.2011490Z         x1 = x[:, D:]
2025-05-07T20:32:54.2011696Z     
2025-05-07T20:32:54.2011885Z         if contiguous:
2025-05-07T20:32:54.2012119Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2012378Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2012608Z     
2025-05-07T20:32:54.2012804Z         if scale_ub is not None:
2025-05-07T20:32:54.2013179Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2013503Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2013805Z             )
2025-05-07T20:32:54.2013997Z         else:
2025-05-07T20:32:54.2014203Z             scale_ub_tensor = None
2025-05-07T20:32:54.2014450Z     
2025-05-07T20:32:54.2014681Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2014989Z             op = silu_mul_quant
2025-05-07T20:32:54.2015246Z             if compiled:
2025-05-07T20:32:54.2015492Z                 op = torch.compile(op)
2025-05-07T20:32:54.2015780Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2016102Z     
2025-05-07T20:32:54.2016298Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2016458Z 
2025-05-07T20:32:54.2016604Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2016897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2017229Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2017508Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2018180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2018907Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2019438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2020111Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2020762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2021289Z     kernel = self.compile(
2025-05-07T20:32:54.2021713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2021898Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2022023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2022027Z 
2025-05-07T20:32:54.2022235Z self = <triton.compiler.compiler.ASTSource object at 0x7fc890eb3d10>
2025-05-07T20:32:54.2022998Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2023499Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc890f7f240>}
2025-05-07T20:32:54.2024240Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2024430Z context = <triton._C.libtriton.ir.context object at 0x7fc88b62c4b0>
2025-05-07T20:32:54.2024435Z 
2025-05-07T20:32:54.2024609Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2024867Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2024975Z                            module_map=module_map)
2025-05-07T20:32:54.2025142Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2025240Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2025323Z E       ^
2025-05-07T20:32:54.2025675Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2025682Z 
2025-05-07T20:32:54.2026091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2026098Z 
2025-05-07T20:32:54.2026205Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2026423Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2026497Z     T=4096,
2025-05-07T20:32:54.2026579Z     D=7168,
2025-05-07T20:32:54.2026658Z     scale_ub=None,
2025-05-07T20:32:54.2026747Z     contiguous=False,
2025-05-07T20:32:54.2026831Z     compiled=False,
2025-05-07T20:32:54.2026905Z )
2025-05-07T20:32:54.2027123Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2027291Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.2027295Z 
2025-05-07T20:32:54.2027369Z     @given(
2025-05-07T20:32:54.2027495Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2027638Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2027752Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2027945Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2028059Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2028137Z     )
2025-05-07T20:32:54.2028378Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2028469Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2028548Z         self,
2025-05-07T20:32:54.2028624Z         T: int,
2025-05-07T20:32:54.2028743Z         D: int,
2025-05-07T20:32:54.2028849Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2028939Z         contiguous: bool,
2025-05-07T20:32:54.2029023Z         compiled: bool,
2025-05-07T20:32:54.2029103Z     ) -> None:
2025-05-07T20:32:54.2029196Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2029266Z     
2025-05-07T20:32:54.2029438Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2029514Z     
2025-05-07T20:32:54.2029610Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2029775Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2029865Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2029951Z         x0 = x[:, :D]
2025-05-07T20:32:54.2030028Z         x1 = x[:, D:]
2025-05-07T20:32:54.2030100Z     
2025-05-07T20:32:54.2030186Z         if contiguous:
2025-05-07T20:32:54.2030278Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2030365Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2030448Z     
2025-05-07T20:32:54.2030539Z         if scale_ub is not None:
2025-05-07T20:32:54.2030643Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2030783Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2030860Z             )
2025-05-07T20:32:54.2030936Z         else:
2025-05-07T20:32:54.2031036Z             scale_ub_tensor = None
2025-05-07T20:32:54.2031116Z     
2025-05-07T20:32:54.2031250Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2031342Z             op = silu_mul_quant
2025-05-07T20:32:54.2031430Z             if compiled:
2025-05-07T20:32:54.2031535Z                 op = torch.compile(op)
2025-05-07T20:32:54.2031639Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2031712Z     
2025-05-07T20:32:54.2044508Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2044517Z 
2025-05-07T20:32:54.2044639Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2044776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2044890Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2044997Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2045507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2045614Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2045975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2046210Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2046545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2046638Z     kernel = self.compile(
2025-05-07T20:32:54.2047027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2047201Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2047339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2047344Z 
2025-05-07T20:32:54.2047551Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88bd571d0>
2025-05-07T20:32:54.2048319Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2048959Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88b181440>}
2025-05-07T20:32:54.2049697Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2049937Z context = <triton._C.libtriton.ir.context object at 0x7fc88aced030>
2025-05-07T20:32:54.2049942Z 
2025-05-07T20:32:54.2050106Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2050370Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2050485Z                            module_map=module_map)
2025-05-07T20:32:54.2050655Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2050762Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2050882Z E       ^
2025-05-07T20:32:54.2051238Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2051243Z 
2025-05-07T20:32:54.2051662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2051667Z 
2025-05-07T20:32:54.2051770Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2052000Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2052079Z     T=128,
2025-05-07T20:32:54.2052157Z     D=7168,
2025-05-07T20:32:54.2052247Z     scale_ub=None,
2025-05-07T20:32:54.2052336Z     contiguous=False,
2025-05-07T20:32:54.2052419Z     compiled=True,
2025-05-07T20:32:54.2052506Z )
2025-05-07T20:32:54.2052725Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2052900Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.2052908Z 
2025-05-07T20:32:54.2053099Z     @given(
2025-05-07T20:32:54.2053222Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2053332Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2053446Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2053563Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2053685Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2053767Z     )
2025-05-07T20:32:54.2054013Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2054112Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2054191Z         self,
2025-05-07T20:32:54.2054270Z         T: int,
2025-05-07T20:32:54.2054355Z         D: int,
2025-05-07T20:32:54.2054454Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2054548Z         contiguous: bool,
2025-05-07T20:32:54.2054640Z         compiled: bool,
2025-05-07T20:32:54.2054723Z     ) -> None:
2025-05-07T20:32:54.2054826Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2054902Z     
2025-05-07T20:32:54.2055073Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2055155Z     
2025-05-07T20:32:54.2055245Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2055368Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2055464Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2055549Z         x0 = x[:, :D]
2025-05-07T20:32:54.2055628Z         x1 = x[:, D:]
2025-05-07T20:32:54.2055708Z     
2025-05-07T20:32:54.2055791Z         if contiguous:
2025-05-07T20:32:54.2055883Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2055984Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2056058Z     
2025-05-07T20:32:54.2056150Z         if scale_ub is not None:
2025-05-07T20:32:54.2056312Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2056446Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2056570Z             )
2025-05-07T20:32:54.2056648Z         else:
2025-05-07T20:32:54.2056739Z             scale_ub_tensor = None
2025-05-07T20:32:54.2056819Z     
2025-05-07T20:32:54.2056947Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2057035Z             op = silu_mul_quant
2025-05-07T20:32:54.2057124Z             if compiled:
2025-05-07T20:32:54.2057225Z                 op = torch.compile(op)
2025-05-07T20:32:54.2057371Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2057451Z     
2025-05-07T20:32:54.2057541Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.2057661Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.2057739Z     
2025-05-07T20:32:54.2057874Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2057980Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.2058083Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.2058207Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.2058406Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.2058481Z     
2025-05-07T20:32:54.2058584Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.2058589Z 
2025-05-07T20:32:54.2058696Z moe/activation_test.py:126: 
2025-05-07T20:32:54.2058823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2058938Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.2059074Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.2059988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.2060107Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.2060471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2060699Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2061069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.2061324Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.2061703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.2061872Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.2062210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.2062298Z     fn()
2025-05-07T20:32:54.2062695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.2062792Z     self.fn.run(
2025-05-07T20:32:54.2063140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2063234Z     kernel = self.compile(
2025-05-07T20:32:54.2063618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2063793Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2063921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2063929Z 
2025-05-07T20:32:54.2064145Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88b387ad0>
2025-05-07T20:32:54.2064918Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2065658Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88af64540>}
2025-05-07T20:32:54.2066396Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2066596Z context = <triton._C.libtriton.ir.context object at 0x7fc88b559cb0>
2025-05-07T20:32:54.2066601Z 
2025-05-07T20:32:54.2066766Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2067093Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2067211Z                            module_map=module_map)
2025-05-07T20:32:54.2067375Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2067481Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.2067571Z E       ^
2025-05-07T20:32:54.2067986Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2067991Z 
2025-05-07T20:32:54.2068408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2068412Z 
2025-05-07T20:32:54.2068513Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2068735Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2068824Z     T=128,
2025-05-07T20:32:54.2068904Z     D=7168,
2025-05-07T20:32:54.2068987Z     scale_ub=None,
2025-05-07T20:32:54.2069085Z     contiguous=False,
2025-05-07T20:32:54.2069169Z     compiled=False,
2025-05-07T20:32:54.2069253Z )
2025-05-07T20:32:54.2069469Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2069638Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.2069645Z 
2025-05-07T20:32:54.2069729Z     @given(
2025-05-07T20:32:54.2069850Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2069955Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2070079Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2070195Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2070307Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2070391Z     )
2025-05-07T20:32:54.2070633Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2070736Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2070815Z         self,
2025-05-07T20:32:54.2070896Z         T: int,
2025-05-07T20:32:54.2070978Z         D: int,
2025-05-07T20:32:54.2071076Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2071165Z         contiguous: bool,
2025-05-07T20:32:54.2071256Z         compiled: bool,
2025-05-07T20:32:54.2071338Z     ) -> None:
2025-05-07T20:32:54.2071430Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2071508Z     
2025-05-07T20:32:54.2071682Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2071758Z     
2025-05-07T20:32:54.2071858Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2071982Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2072077Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2072160Z         x0 = x[:, :D]
2025-05-07T20:32:54.2072239Z         x1 = x[:, D:]
2025-05-07T20:32:54.2072315Z     
2025-05-07T20:32:54.2072401Z         if contiguous:
2025-05-07T20:32:54.2072492Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2072587Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2072664Z     
2025-05-07T20:32:54.2072755Z         if scale_ub is not None:
2025-05-07T20:32:54.2072869Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2073002Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2073153Z             )
2025-05-07T20:32:54.2073237Z         else:
2025-05-07T20:32:54.2073331Z             scale_ub_tensor = None
2025-05-07T20:32:54.2073447Z     
2025-05-07T20:32:54.2073590Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2073680Z             op = silu_mul_quant
2025-05-07T20:32:54.2073770Z             if compiled:
2025-05-07T20:32:54.2073872Z                 op = torch.compile(op)
2025-05-07T20:32:54.2073981Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2074057Z     
2025-05-07T20:32:54.2074149Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2074194Z 
2025-05-07T20:32:54.2074297Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2074436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2074540Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2074641Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2075142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2075244Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2075645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2075869Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2076203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2076304Z     kernel = self.compile(
2025-05-07T20:32:54.2076685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2076872Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2076999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2077004Z 
2025-05-07T20:32:54.2077212Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88b3effb0>
2025-05-07T20:32:54.2077997Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2078497Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88af66700>}
2025-05-07T20:32:54.2079237Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2079437Z context = <triton._C.libtriton.ir.context object at 0x7fc88b28f0f0>
2025-05-07T20:32:54.2079442Z 
2025-05-07T20:32:54.2079604Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2079874Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2079988Z                            module_map=module_map)
2025-05-07T20:32:54.2080159Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2080261Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2080341Z E       ^
2025-05-07T20:32:54.2080700Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2080705Z 
2025-05-07T20:32:54.2081113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2081121Z 
2025-05-07T20:32:54.2081228Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2081453Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2081533Z     T=4096,
2025-05-07T20:32:54.2081620Z     D=5120,
2025-05-07T20:32:54.2081753Z     scale_ub=1200.0,
2025-05-07T20:32:54.2081839Z     contiguous=True,
2025-05-07T20:32:54.2081930Z     compiled=False,
2025-05-07T20:32:54.2082008Z )
2025-05-07T20:32:54.2082269Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2082455Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.2082460Z 
2025-05-07T20:32:54.2082537Z     @given(
2025-05-07T20:32:54.2082672Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2082795Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2082981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2083107Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2083221Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2083301Z     )
2025-05-07T20:32:54.2083551Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2083643Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2083723Z         self,
2025-05-07T20:32:54.2083809Z         T: int,
2025-05-07T20:32:54.2083886Z         D: int,
2025-05-07T20:32:54.2084039Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2084140Z         contiguous: bool,
2025-05-07T20:32:54.2084227Z         compiled: bool,
2025-05-07T20:32:54.2084311Z     ) -> None:
2025-05-07T20:32:54.2084405Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2084482Z     
2025-05-07T20:32:54.2084660Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2084733Z     
2025-05-07T20:32:54.2084827Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2084959Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2085050Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2085132Z         x0 = x[:, :D]
2025-05-07T20:32:54.2085219Z         x1 = x[:, D:]
2025-05-07T20:32:54.2085293Z     
2025-05-07T20:32:54.2085379Z         if contiguous:
2025-05-07T20:32:54.2085479Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2085569Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2085652Z     
2025-05-07T20:32:54.2085745Z         if scale_ub is not None:
2025-05-07T20:32:54.2085855Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2085993Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2086070Z             )
2025-05-07T20:32:54.2086149Z         else:
2025-05-07T20:32:54.2086247Z             scale_ub_tensor = None
2025-05-07T20:32:54.2086321Z     
2025-05-07T20:32:54.2086449Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2086550Z             op = silu_mul_quant
2025-05-07T20:32:54.2086632Z             if compiled:
2025-05-07T20:32:54.2086731Z                 op = torch.compile(op)
2025-05-07T20:32:54.2086840Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2086914Z     
2025-05-07T20:32:54.2087011Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2087016Z 
2025-05-07T20:32:54.2087113Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2087240Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2087350Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2087451Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2087944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2088052Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2088404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2088632Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2088968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2089065Z     kernel = self.compile(
2025-05-07T20:32:54.2089451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2089714Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2089846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2089858Z 
2025-05-07T20:32:54.2090063Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88b3edee0>
2025-05-07T20:32:54.2090829Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2091374Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88af676a0>}
2025-05-07T20:32:54.2092108Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2092345Z context = <triton._C.libtriton.ir.context object at 0x7fc88a74b330>
2025-05-07T20:32:54.2092350Z 
2025-05-07T20:32:54.2092516Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2092777Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2092892Z                            module_map=module_map)
2025-05-07T20:32:54.2093162Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2093281Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2093370Z E       ^
2025-05-07T20:32:54.2093723Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2093728Z 
2025-05-07T20:32:54.2094145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2094152Z 
2025-05-07T20:32:54.2094255Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2094482Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2094573Z     T=1,
2025-05-07T20:32:54.2094652Z     D=5120,
2025-05-07T20:32:54.2094741Z     scale_ub=None,
2025-05-07T20:32:54.2094826Z     contiguous=True,
2025-05-07T20:32:54.2094910Z     compiled=True,
2025-05-07T20:32:54.2094994Z )
2025-05-07T20:32:54.2095212Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2095376Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.2095380Z 
2025-05-07T20:32:54.2095466Z     @given(
2025-05-07T20:32:54.2095585Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2095685Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2095808Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2095931Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2096057Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2096133Z     )
2025-05-07T20:32:54.2096385Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2096488Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2096567Z         self,
2025-05-07T20:32:54.2096647Z         T: int,
2025-05-07T20:32:54.2096732Z         D: int,
2025-05-07T20:32:54.2096831Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2096921Z         contiguous: bool,
2025-05-07T20:32:54.2097019Z         compiled: bool,
2025-05-07T20:32:54.2097096Z     ) -> None:
2025-05-07T20:32:54.2097204Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2097278Z     
2025-05-07T20:32:54.2097445Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2097521Z     
2025-05-07T20:32:54.2097612Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2097786Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2097887Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2097969Z         x0 = x[:, :D]
2025-05-07T20:32:54.2098091Z         x1 = x[:, D:]
2025-05-07T20:32:54.2098174Z     
2025-05-07T20:32:54.2098255Z         if contiguous:
2025-05-07T20:32:54.2098351Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2098440Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2098511Z     
2025-05-07T20:32:54.2098609Z         if scale_ub is not None:
2025-05-07T20:32:54.2098715Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2098890Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2098973Z             )
2025-05-07T20:32:54.2099049Z         else:
2025-05-07T20:32:54.2099140Z             scale_ub_tensor = None
2025-05-07T20:32:54.2099218Z     
2025-05-07T20:32:54.2099345Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2099434Z             op = silu_mul_quant
2025-05-07T20:32:54.2099527Z             if compiled:
2025-05-07T20:32:54.2099625Z                 op = torch.compile(op)
2025-05-07T20:32:54.2099740Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2099879Z     
2025-05-07T20:32:54.2099970Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.2100096Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.2100169Z     
2025-05-07T20:32:54.2100304Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2100413Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.2100512Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.2100636Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.2100778Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.2100849Z     
2025-05-07T20:32:54.2100952Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.2100957Z 
2025-05-07T20:32:54.2101053Z moe/activation_test.py:126: 
2025-05-07T20:32:54.2101181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2101289Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.2101432Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.2101980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.2102086Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.2102437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2102666Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2103052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.2103327Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.2103700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.2103866Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.2104205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.2104281Z     fn()
2025-05-07T20:32:54.2104676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.2104763Z     self.fn.run(
2025-05-07T20:32:54.2105098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2105190Z     kernel = self.compile(
2025-05-07T20:32:54.2105569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2105741Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2105923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2105927Z 
2025-05-07T20:32:54.2106172Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88b6eadb0>
2025-05-07T20:32:54.2106939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2107440Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88bd43880>}
2025-05-07T20:32:54.2108209Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2108417Z context = <triton._C.libtriton.ir.context object at 0x7fc88b359df0>
2025-05-07T20:32:54.2108424Z 
2025-05-07T20:32:54.2108588Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2108881Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2108995Z                            module_map=module_map)
2025-05-07T20:32:54.2109155Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2109256Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.2109338Z E       ^
2025-05-07T20:32:54.2109686Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2109693Z 
2025-05-07T20:32:54.2110106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2110111Z 
2025-05-07T20:32:54.2110212Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2110435Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2110517Z     T=2048,
2025-05-07T20:32:54.2110593Z     D=5120,
2025-05-07T20:32:54.2110679Z     scale_ub=None,
2025-05-07T20:32:54.2110771Z     contiguous=True,
2025-05-07T20:32:54.2110853Z     compiled=True,
2025-05-07T20:32:54.2110934Z )
2025-05-07T20:32:54.2111149Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2111315Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.2111319Z 
2025-05-07T20:32:54.2111406Z     @given(
2025-05-07T20:32:54.2111522Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2111622Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2111744Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2111857Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2111971Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2112058Z     )
2025-05-07T20:32:54.2112300Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2112406Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2112483Z         self,
2025-05-07T20:32:54.2113156Z         T: int,
2025-05-07T20:32:54.2113239Z         D: int,
2025-05-07T20:32:54.2113339Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2113426Z         contiguous: bool,
2025-05-07T20:32:54.2113515Z         compiled: bool,
2025-05-07T20:32:54.2113592Z     ) -> None:
2025-05-07T20:32:54.2113685Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2113765Z     
2025-05-07T20:32:54.2113932Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2114001Z     
2025-05-07T20:32:54.2114097Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2114220Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2114315Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2114393Z         x0 = x[:, :D]
2025-05-07T20:32:54.2114523Z         x1 = x[:, D:]
2025-05-07T20:32:54.2114601Z     
2025-05-07T20:32:54.2114682Z         if contiguous:
2025-05-07T20:32:54.2114816Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2114916Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2114988Z     
2025-05-07T20:32:54.2115078Z         if scale_ub is not None:
2025-05-07T20:32:54.2115190Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2115322Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2115393Z             )
2025-05-07T20:32:54.2115474Z         else:
2025-05-07T20:32:54.2115608Z             scale_ub_tensor = None
2025-05-07T20:32:54.2115682Z     
2025-05-07T20:32:54.2115810Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2115900Z             op = silu_mul_quant
2025-05-07T20:32:54.2115990Z             if compiled:
2025-05-07T20:32:54.2116089Z                 op = torch.compile(op)
2025-05-07T20:32:54.2116194Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2116278Z     
2025-05-07T20:32:54.2116367Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.2116528Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.2116604Z     
2025-05-07T20:32:54.2116739Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2116840Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.2116944Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.2117066Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.2117213Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.2117289Z     
2025-05-07T20:32:54.2117389Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.2117393Z 
2025-05-07T20:32:54.2117495Z moe/activation_test.py:126: 
2025-05-07T20:32:54.2117621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2117723Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.2117862Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.2118415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.2118520Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.2118873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2119092Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2119460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.2119711Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.2120085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.2120250Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.2120587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.2120668Z     fn()
2025-05-07T20:32:54.2121062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.2121143Z     self.fn.run(
2025-05-07T20:32:54.2121480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2121576Z     kernel = self.compile(
2025-05-07T20:32:54.2121954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2122127Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2122252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2122257Z 
2025-05-07T20:32:54.2122512Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a846c00>
2025-05-07T20:32:54.2123317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2123825Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88b6f3ba0>}
2025-05-07T20:32:54.2124555Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2124783Z context = <triton._C.libtriton.ir.context object at 0x7fc88a65a7b0>
2025-05-07T20:32:54.2124787Z 
2025-05-07T20:32:54.2124955Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2125213Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2125364Z                            module_map=module_map)
2025-05-07T20:32:54.2125524Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2125623Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.2125707Z E       ^
2025-05-07T20:32:54.2126055Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2126060Z 
2025-05-07T20:32:54.2126467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2126479Z 
2025-05-07T20:32:54.2126581Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2126800Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2126881Z     T=128,
2025-05-07T20:32:54.2126959Z     D=5120,
2025-05-07T20:32:54.2127041Z     scale_ub=None,
2025-05-07T20:32:54.2127131Z     contiguous=True,
2025-05-07T20:32:54.2127214Z     compiled=True,
2025-05-07T20:32:54.2127293Z )
2025-05-07T20:32:54.2127518Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2127682Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.2127686Z 
2025-05-07T20:32:54.2127766Z     @given(
2025-05-07T20:32:54.2127883Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2127986Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2128112Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2128226Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2128340Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2128421Z     )
2025-05-07T20:32:54.2128660Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2128753Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2128837Z         self,
2025-05-07T20:32:54.2128913Z         T: int,
2025-05-07T20:32:54.2128995Z         D: int,
2025-05-07T20:32:54.2129095Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2129184Z         contiguous: bool,
2025-05-07T20:32:54.2129272Z         compiled: bool,
2025-05-07T20:32:54.2129349Z     ) -> None:
2025-05-07T20:32:54.2129441Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2129520Z     
2025-05-07T20:32:54.2129685Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2129758Z     
2025-05-07T20:32:54.2129852Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2129974Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2130058Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2130144Z         x0 = x[:, :D]
2025-05-07T20:32:54.2130222Z         x1 = x[:, D:]
2025-05-07T20:32:54.2130294Z     
2025-05-07T20:32:54.2130384Z         if contiguous:
2025-05-07T20:32:54.2130523Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2130619Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2130696Z     
2025-05-07T20:32:54.2130831Z         if scale_ub is not None:
2025-05-07T20:32:54.2130943Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2131073Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2131145Z             )
2025-05-07T20:32:54.2131225Z         else:
2025-05-07T20:32:54.2131317Z             scale_ub_tensor = None
2025-05-07T20:32:54.2131390Z     
2025-05-07T20:32:54.2131523Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2131674Z             op = silu_mul_quant
2025-05-07T20:32:54.2131757Z             if compiled:
2025-05-07T20:32:54.2131860Z                 op = torch.compile(op)
2025-05-07T20:32:54.2131967Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2132042Z     
2025-05-07T20:32:54.2132132Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.2132253Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.2132327Z     
2025-05-07T20:32:54.2132464Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2132608Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.2132715Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.2132835Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.2133042Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.2133125Z     
2025-05-07T20:32:54.2133222Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.2133231Z 
2025-05-07T20:32:54.2133335Z moe/activation_test.py:126: 
2025-05-07T20:32:54.2133489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2133609Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.2133759Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.2134310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.2134414Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.2134776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2134998Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2135364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.2135616Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.2135986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.2136157Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.2136493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.2136577Z     fn()
2025-05-07T20:32:54.2136976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.2137060Z     self.fn.run(
2025-05-07T20:32:54.2137401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2137491Z     kernel = self.compile(
2025-05-07T20:32:54.2137864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2138048Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2138175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2138180Z 
2025-05-07T20:32:54.2138389Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a8444a0>
2025-05-07T20:32:54.2139201Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2139732Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a4914e0>}
2025-05-07T20:32:54.2140469Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2140695Z context = <triton._C.libtriton.ir.context object at 0x7fc88aedd830>
2025-05-07T20:32:54.2140699Z 
2025-05-07T20:32:54.2140867Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2141123Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2141233Z                            module_map=module_map)
2025-05-07T20:32:54.2141398Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2141538Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.2141624Z E       ^
2025-05-07T20:32:54.2141975Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2141979Z 
2025-05-07T20:32:54.2142383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2142391Z 
2025-05-07T20:32:54.2142497Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2142718Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2142798Z     T=4096,
2025-05-07T20:32:54.2142877Z     D=5120,
2025-05-07T20:32:54.2142958Z     scale_ub=None,
2025-05-07T20:32:54.2143050Z     contiguous=True,
2025-05-07T20:32:54.2143130Z     compiled=True,
2025-05-07T20:32:54.2143206Z )
2025-05-07T20:32:54.2143427Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2143599Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.2143603Z 
2025-05-07T20:32:54.2143677Z     @given(
2025-05-07T20:32:54.2143800Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2143900Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2144018Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2144140Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2144255Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2144339Z     )
2025-05-07T20:32:54.2144581Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2144672Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2144754Z         self,
2025-05-07T20:32:54.2144828Z         T: int,
2025-05-07T20:32:54.2144910Z         D: int,
2025-05-07T20:32:54.2145015Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2145103Z         contiguous: bool,
2025-05-07T20:32:54.2145189Z         compiled: bool,
2025-05-07T20:32:54.2145272Z     ) -> None:
2025-05-07T20:32:54.2145366Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2145436Z     
2025-05-07T20:32:54.2145605Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2145682Z     
2025-05-07T20:32:54.2145779Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2145902Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2145994Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2146079Z         x0 = x[:, :D]
2025-05-07T20:32:54.2146159Z         x1 = x[:, D:]
2025-05-07T20:32:54.2146239Z     
2025-05-07T20:32:54.2146329Z         if contiguous:
2025-05-07T20:32:54.2146423Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2146518Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2146597Z     
2025-05-07T20:32:54.2146734Z         if scale_ub is not None:
2025-05-07T20:32:54.2146841Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2147019Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2147096Z             )
2025-05-07T20:32:54.2147179Z         else:
2025-05-07T20:32:54.2147274Z             scale_ub_tensor = None
2025-05-07T20:32:54.2147344Z     
2025-05-07T20:32:54.2147479Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2147566Z             op = silu_mul_quant
2025-05-07T20:32:54.2147651Z             if compiled:
2025-05-07T20:32:54.2147799Z                 op = torch.compile(op)
2025-05-07T20:32:54.2147903Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2147974Z     
2025-05-07T20:32:54.2148071Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.2148190Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.2148266Z     
2025-05-07T20:32:54.2148407Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2148512Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.2148618Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.2148784Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.2148924Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.2149002Z     
2025-05-07T20:32:54.2149102Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.2149106Z 
2025-05-07T20:32:54.2149203Z moe/activation_test.py:126: 
2025-05-07T20:32:54.2149335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2149441Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.2149573Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.2150129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.2150228Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.2150587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2150812Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2151170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.2151426Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.2151794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.2151967Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.2152299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.2152374Z     fn()
2025-05-07T20:32:54.2152772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.2152856Z     self.fn.run(
2025-05-07T20:32:54.2153193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2153295Z     kernel = self.compile(
2025-05-07T20:32:54.2153713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2153896Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2154024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2154032Z 
2025-05-07T20:32:54.2154236Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a868fe0>
2025-05-07T20:32:54.2155008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2155587Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a2b54e0>}
2025-05-07T20:32:54.2156323Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2156512Z context = <triton._C.libtriton.ir.context object at 0x7fc88aaf66b0>
2025-05-07T20:32:54.2156555Z 
2025-05-07T20:32:54.2156726Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2156986Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2157092Z                            module_map=module_map)
2025-05-07T20:32:54.2157261Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2157368Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.2157446Z E       ^
2025-05-07T20:32:54.2157847Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2157852Z 
2025-05-07T20:32:54.2158260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2158265Z 
2025-05-07T20:32:54.2158373Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2158594Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2158673Z     T=16384,
2025-05-07T20:32:54.2158755Z     D=5120,
2025-05-07T20:32:54.2158835Z     scale_ub=None,
2025-05-07T20:32:54.2158922Z     contiguous=True,
2025-05-07T20:32:54.2159008Z     compiled=True,
2025-05-07T20:32:54.2159082Z )
2025-05-07T20:32:54.2159589Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2159829Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.2159834Z 
2025-05-07T20:32:54.2159911Z     @given(
2025-05-07T20:32:54.2160040Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2160137Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2160252Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2160372Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2160483Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2160557Z     )
2025-05-07T20:32:54.2160805Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2160896Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2160981Z         self,
2025-05-07T20:32:54.2161057Z         T: int,
2025-05-07T20:32:54.2161133Z         D: int,
2025-05-07T20:32:54.2161237Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2161325Z         contiguous: bool,
2025-05-07T20:32:54.2161410Z         compiled: bool,
2025-05-07T20:32:54.2161494Z     ) -> None:
2025-05-07T20:32:54.2161585Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2161656Z     
2025-05-07T20:32:54.2161834Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2161909Z     
2025-05-07T20:32:54.2161998Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2162128Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2162221Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2162300Z         x0 = x[:, :D]
2025-05-07T20:32:54.2162383Z         x1 = x[:, D:]
2025-05-07T20:32:54.2162458Z     
2025-05-07T20:32:54.2162549Z         if contiguous:
2025-05-07T20:32:54.2162642Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2162731Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2162810Z     
2025-05-07T20:32:54.2162902Z         if scale_ub is not None:
2025-05-07T20:32:54.2163005Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2163143Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2163381Z             )
2025-05-07T20:32:54.2163454Z         else:
2025-05-07T20:32:54.2163671Z             scale_ub_tensor = None
2025-05-07T20:32:54.2163751Z     
2025-05-07T20:32:54.2163896Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2163995Z             op = silu_mul_quant
2025-05-07T20:32:54.2164079Z             if compiled:
2025-05-07T20:32:54.2164184Z                 op = torch.compile(op)
2025-05-07T20:32:54.2164290Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2164427Z     
2025-05-07T20:32:54.2164523Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.2164641Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.2164716Z     
2025-05-07T20:32:54.2164857Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2164957Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.2165055Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.2165182Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.2165321Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.2165463Z     
2025-05-07T20:32:54.2172482Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.2172491Z 
2025-05-07T20:32:54.2172598Z moe/activation_test.py:126: 
2025-05-07T20:32:54.2172736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2172857Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.2173090Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.2173663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.2173767Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.2174125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2174364Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2174735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.2175002Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.2175371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.2175540Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.2175890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.2175967Z     fn()
2025-05-07T20:32:54.2176364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.2176459Z     self.fn.run(
2025-05-07T20:32:54.2176794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2176893Z     kernel = self.compile(
2025-05-07T20:32:54.2177273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2177448Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2177585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2177590Z 
2025-05-07T20:32:54.2177798Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88b70f980>
2025-05-07T20:32:54.2178575Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2179074Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79fd1afc0>}
2025-05-07T20:32:54.2179951Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2180153Z context = <triton._C.libtriton.ir.context object at 0x7fc88ae238f0>
2025-05-07T20:32:54.2180157Z 
2025-05-07T20:32:54.2180325Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2180596Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2180747Z                            module_map=module_map)
2025-05-07T20:32:54.2180910Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2181017Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.2181094Z E       ^
2025-05-07T20:32:54.2181445Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2181459Z 
2025-05-07T20:32:54.2181916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2181921Z 
2025-05-07T20:32:54.2182025Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2182255Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2182334Z     T=1,
2025-05-07T20:32:54.2182413Z     D=5120,
2025-05-07T20:32:54.2182504Z     scale_ub=1200.0,
2025-05-07T20:32:54.2182593Z     contiguous=True,
2025-05-07T20:32:54.2182678Z     compiled=True,
2025-05-07T20:32:54.2182763Z )
2025-05-07T20:32:54.2182979Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2183152Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.2183156Z 
2025-05-07T20:32:54.2183235Z     @given(
2025-05-07T20:32:54.2183360Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2183466Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2183589Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2183707Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2183830Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2183905Z     )
2025-05-07T20:32:54.2184153Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2184249Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2184331Z         self,
2025-05-07T20:32:54.2184416Z         T: int,
2025-05-07T20:32:54.2184494Z         D: int,
2025-05-07T20:32:54.2184592Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2184692Z         contiguous: bool,
2025-05-07T20:32:54.2184776Z         compiled: bool,
2025-05-07T20:32:54.2184853Z     ) -> None:
2025-05-07T20:32:54.2184955Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2185032Z     
2025-05-07T20:32:54.2185205Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2185287Z     
2025-05-07T20:32:54.2185383Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2185508Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2185607Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2185686Z         x0 = x[:, :D]
2025-05-07T20:32:54.2185774Z         x1 = x[:, D:]
2025-05-07T20:32:54.2185848Z     
2025-05-07T20:32:54.2185932Z         if contiguous:
2025-05-07T20:32:54.2186035Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2186128Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2186204Z     
2025-05-07T20:32:54.2186302Z         if scale_ub is not None:
2025-05-07T20:32:54.2186407Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2186541Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2186622Z             )
2025-05-07T20:32:54.2186696Z         else:
2025-05-07T20:32:54.2186838Z             scale_ub_tensor = None
2025-05-07T20:32:54.2186917Z     
2025-05-07T20:32:54.2187047Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2187188Z             op = silu_mul_quant
2025-05-07T20:32:54.2187275Z             if compiled:
2025-05-07T20:32:54.2187378Z                 op = torch.compile(op)
2025-05-07T20:32:54.2187492Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2187566Z     
2025-05-07T20:32:54.2187659Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2187664Z 
2025-05-07T20:32:54.2187770Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2187942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2188046Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2188155Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2188519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2188620Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2189117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2189258Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2189619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2189840Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2190175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2190282Z     kernel = self.compile(
2025-05-07T20:32:54.2190656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2190841Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2190971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2190978Z 
2025-05-07T20:32:54.2191182Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a86bef0>
2025-05-07T20:32:54.2191962Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2192458Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a9d9260>}
2025-05-07T20:32:54.2193204Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2193394Z context = <triton._C.libtriton.ir.context object at 0x7fc88ad6e8f0>
2025-05-07T20:32:54.2193399Z 
2025-05-07T20:32:54.2193572Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2193835Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2193943Z                            module_map=module_map)
2025-05-07T20:32:54.2194112Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2194214Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2194290Z E       ^
2025-05-07T20:32:54.2194649Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2194656Z 
2025-05-07T20:32:54.2195062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2195067Z 
2025-05-07T20:32:54.2195176Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2195403Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2195527Z     T=1,
2025-05-07T20:32:54.2195615Z     D=5120,
2025-05-07T20:32:54.2195702Z     scale_ub=None,
2025-05-07T20:32:54.2195830Z     contiguous=False,
2025-05-07T20:32:54.2195930Z     compiled=True,
2025-05-07T20:32:54.2196007Z )
2025-05-07T20:32:54.2196227Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2196400Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.2196405Z 
2025-05-07T20:32:54.2196482Z     @given(
2025-05-07T20:32:54.2196611Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2196758Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2196875Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2196997Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2197109Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2197184Z     )
2025-05-07T20:32:54.2197432Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2197526Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2197612Z         self,
2025-05-07T20:32:54.2197693Z         T: int,
2025-05-07T20:32:54.2197809Z         D: int,
2025-05-07T20:32:54.2197918Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2198009Z         contiguous: bool,
2025-05-07T20:32:54.2198097Z         compiled: bool,
2025-05-07T20:32:54.2198183Z     ) -> None:
2025-05-07T20:32:54.2198280Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2198354Z     
2025-05-07T20:32:54.2198532Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2198608Z     
2025-05-07T20:32:54.2198699Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2198832Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2198920Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2198999Z         x0 = x[:, :D]
2025-05-07T20:32:54.2199093Z         x1 = x[:, D:]
2025-05-07T20:32:54.2199167Z     
2025-05-07T20:32:54.2199258Z         if contiguous:
2025-05-07T20:32:54.2199350Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2199443Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2199527Z     
2025-05-07T20:32:54.2199618Z         if scale_ub is not None:
2025-05-07T20:32:54.2199723Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2199865Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2199940Z             )
2025-05-07T20:32:54.2200016Z         else:
2025-05-07T20:32:54.2200119Z             scale_ub_tensor = None
2025-05-07T20:32:54.2200197Z     
2025-05-07T20:32:54.2200324Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2200421Z             op = silu_mul_quant
2025-05-07T20:32:54.2200508Z             if compiled:
2025-05-07T20:32:54.2200617Z                 op = torch.compile(op)
2025-05-07T20:32:54.2200720Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2200793Z     
2025-05-07T20:32:54.2200893Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.2201014Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.2201087Z     
2025-05-07T20:32:54.2201235Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2201340Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.2201439Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.2201567Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.2201708Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.2201780Z     
2025-05-07T20:32:54.2201891Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.2201895Z 
2025-05-07T20:32:54.2201993Z moe/activation_test.py:126: 
2025-05-07T20:32:54.2202128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2202232Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.2202368Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.2203049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.2203153Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.2203515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2203769Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2204148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.2204451Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.2204817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.2204983Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.2205328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.2205408Z     fn()
2025-05-07T20:32:54.2205852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.2205936Z     self.fn.run(
2025-05-07T20:32:54.2206268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2206369Z     kernel = self.compile(
2025-05-07T20:32:54.2206742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2206919Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2207055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2207059Z 
2025-05-07T20:32:54.2207268Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a4ba300>
2025-05-07T20:32:54.2208047Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2208544Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a9dbd80>}
2025-05-07T20:32:54.2209285Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2209480Z context = <triton._C.libtriton.ir.context object at 0x7fc79f5562b0>
2025-05-07T20:32:54.2209485Z 
2025-05-07T20:32:54.2209649Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2209918Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2210033Z                            module_map=module_map)
2025-05-07T20:32:54.2210206Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2210311Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.2210391Z E       ^
2025-05-07T20:32:54.2210749Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2210753Z 
2025-05-07T20:32:54.2211161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2211167Z 
2025-05-07T20:32:54.2211271Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2211502Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2211580Z     T=1,
2025-05-07T20:32:54.2211667Z     D=5120,
2025-05-07T20:32:54.2211752Z     scale_ub=None,
2025-05-07T20:32:54.2211838Z     contiguous=True,
2025-05-07T20:32:54.2211991Z     compiled=False,
2025-05-07T20:32:54.2212064Z )
2025-05-07T20:32:54.2212320Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2212496Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.2212501Z 
2025-05-07T20:32:54.2212577Z     @given(
2025-05-07T20:32:54.2212697Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2212803Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2212916Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2213277Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2213392Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2213469Z     )
2025-05-07T20:32:54.2213721Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2213814Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2213892Z         self,
2025-05-07T20:32:54.2213985Z         T: int,
2025-05-07T20:32:54.2214063Z         D: int,
2025-05-07T20:32:54.2214164Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2214264Z         contiguous: bool,
2025-05-07T20:32:54.2214395Z         compiled: bool,
2025-05-07T20:32:54.2214473Z     ) -> None:
2025-05-07T20:32:54.2214577Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2214652Z     
2025-05-07T20:32:54.2214827Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2214899Z     
2025-05-07T20:32:54.2214992Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2215127Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2215216Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2215295Z         x0 = x[:, :D]
2025-05-07T20:32:54.2215385Z         x1 = x[:, D:]
2025-05-07T20:32:54.2215458Z     
2025-05-07T20:32:54.2215541Z         if contiguous:
2025-05-07T20:32:54.2215644Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2215736Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2215813Z     
2025-05-07T20:32:54.2215912Z         if scale_ub is not None:
2025-05-07T20:32:54.2216019Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2216161Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2216236Z             )
2025-05-07T20:32:54.2216314Z         else:
2025-05-07T20:32:54.2216413Z             scale_ub_tensor = None
2025-05-07T20:32:54.2216486Z     
2025-05-07T20:32:54.2216615Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2216712Z             op = silu_mul_quant
2025-05-07T20:32:54.2216802Z             if compiled:
2025-05-07T20:32:54.2216901Z                 op = torch.compile(op)
2025-05-07T20:32:54.2217013Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2217085Z     
2025-05-07T20:32:54.2217177Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2217181Z 
2025-05-07T20:32:54.2217287Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2217416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2217528Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2217635Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2218128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2218234Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2218587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2218808Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2219152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2219245Z     kernel = self.compile(
2025-05-07T20:32:54.2219634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2219855Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2220029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2220034Z 
2025-05-07T20:32:54.2220248Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88aa12300>
2025-05-07T20:32:54.2221014Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2221559Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79fb00fe0>}
2025-05-07T20:32:54.2222290Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2222491Z context = <triton._C.libtriton.ir.context object at 0x7fc79f55cf70>
2025-05-07T20:32:54.2222495Z 
2025-05-07T20:32:54.2222715Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2222972Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2223087Z                            module_map=module_map)
2025-05-07T20:32:54.2223250Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2223350Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2223439Z E       ^
2025-05-07T20:32:54.2223789Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2223793Z 
2025-05-07T20:32:54.2224208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2224215Z 
2025-05-07T20:32:54.2224316Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2224537Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2224625Z     T=128,
2025-05-07T20:32:54.2224702Z     D=5120,
2025-05-07T20:32:54.2224784Z     scale_ub=None,
2025-05-07T20:32:54.2224882Z     contiguous=False,
2025-05-07T20:32:54.2224967Z     compiled=True,
2025-05-07T20:32:54.2225045Z )
2025-05-07T20:32:54.2225270Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2225441Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.2225449Z 
2025-05-07T20:32:54.2225539Z     @given(
2025-05-07T20:32:54.2225660Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2225763Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2225888Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2226004Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2226134Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2226208Z     )
2025-05-07T20:32:54.2226453Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2226555Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2226634Z         self,
2025-05-07T20:32:54.2226709Z         T: int,
2025-05-07T20:32:54.2226795Z         D: int,
2025-05-07T20:32:54.2226894Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2226988Z         contiguous: bool,
2025-05-07T20:32:54.2227072Z         compiled: bool,
2025-05-07T20:32:54.2227153Z     ) -> None:
2025-05-07T20:32:54.2227253Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2227328Z     
2025-05-07T20:32:54.2227493Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2227574Z     
2025-05-07T20:32:54.2227666Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2227788Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2227932Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2228010Z         x0 = x[:, :D]
2025-05-07T20:32:54.2228088Z         x1 = x[:, D:]
2025-05-07T20:32:54.2228207Z     
2025-05-07T20:32:54.2228294Z         if contiguous:
2025-05-07T20:32:54.2228386Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2228481Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2228558Z     
2025-05-07T20:32:54.2228654Z         if scale_ub is not None:
2025-05-07T20:32:54.2228761Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2228894Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2229013Z             )
2025-05-07T20:32:54.2229088Z         else:
2025-05-07T20:32:54.2229184Z             scale_ub_tensor = None
2025-05-07T20:32:54.2229262Z     
2025-05-07T20:32:54.2229390Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2229478Z             op = silu_mul_quant
2025-05-07T20:32:54.2229565Z             if compiled:
2025-05-07T20:32:54.2229665Z                 op = torch.compile(op)
2025-05-07T20:32:54.2229769Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2229848Z     
2025-05-07T20:32:54.2230009Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2230014Z 
2025-05-07T20:32:54.2230117Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2230244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2230346Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2230449Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2230814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2230907Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2231397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2231492Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2231849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2232076Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2232411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2232508Z     kernel = self.compile(
2025-05-07T20:32:54.2232915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2233103Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2233238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2233243Z 
2025-05-07T20:32:54.2233444Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88bd54650>
2025-05-07T20:32:54.2234208Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2234709Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a9db1a0>}
2025-05-07T20:32:54.2235444Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2235635Z context = <triton._C.libtriton.ir.context object at 0x7fc79f78e470>
2025-05-07T20:32:54.2235639Z 
2025-05-07T20:32:54.2235805Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2236069Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2236176Z                            module_map=module_map)
2025-05-07T20:32:54.2236397Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2236501Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2236615Z E       ^
2025-05-07T20:32:54.2236976Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2236981Z 
2025-05-07T20:32:54.2237389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2237394Z 
2025-05-07T20:32:54.2237497Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2237765Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2237844Z     T=128,
2025-05-07T20:32:54.2237923Z     D=7168,
2025-05-07T20:32:54.2238013Z     scale_ub=1200.0,
2025-05-07T20:32:54.2238099Z     contiguous=False,
2025-05-07T20:32:54.2238190Z     compiled=False,
2025-05-07T20:32:54.2238262Z )
2025-05-07T20:32:54.2238476Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2238657Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.2238664Z 
2025-05-07T20:32:54.2238782Z     @given(
2025-05-07T20:32:54.2238901Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2239007Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2239123Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2239238Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2239359Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2239434Z     )
2025-05-07T20:32:54.2239684Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2239773Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2239849Z         self,
2025-05-07T20:32:54.2239934Z         T: int,
2025-05-07T20:32:54.2240010Z         D: int,
2025-05-07T20:32:54.2240107Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2240203Z         contiguous: bool,
2025-05-07T20:32:54.2240287Z         compiled: bool,
2025-05-07T20:32:54.2240366Z     ) -> None:
2025-05-07T20:32:54.2240471Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2240542Z     
2025-05-07T20:32:54.2240710Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2240790Z     
2025-05-07T20:32:54.2240881Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2241009Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2241097Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2241180Z         x0 = x[:, :D]
2025-05-07T20:32:54.2241262Z         x1 = x[:, D:]
2025-05-07T20:32:54.2241334Z     
2025-05-07T20:32:54.2241416Z         if contiguous:
2025-05-07T20:32:54.2241512Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2241603Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2241674Z     
2025-05-07T20:32:54.2241769Z         if scale_ub is not None:
2025-05-07T20:32:54.2241879Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2242012Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2242092Z             )
2025-05-07T20:32:54.2242170Z         else:
2025-05-07T20:32:54.2242261Z             scale_ub_tensor = None
2025-05-07T20:32:54.2242337Z     
2025-05-07T20:32:54.2242465Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2242560Z             op = silu_mul_quant
2025-05-07T20:32:54.2242642Z             if compiled:
2025-05-07T20:32:54.2242739Z                 op = torch.compile(op)
2025-05-07T20:32:54.2242851Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2242921Z     
2025-05-07T20:32:54.2243010Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2243015Z 
2025-05-07T20:32:54.2243117Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2243246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2243356Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2243538Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2244085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2244190Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2244543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2244761Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2245103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2245236Z     kernel = self.compile(
2025-05-07T20:32:54.2245618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2245788Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2245914Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2245921Z 
2025-05-07T20:32:54.2246170Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88b3ee510>
2025-05-07T20:32:54.2246933Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2247435Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88b3c80e0>}
2025-05-07T20:32:54.2248169Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2248357Z context = <triton._C.libtriton.ir.context object at 0x7fc79ffab1f0>
2025-05-07T20:32:54.2248364Z 
2025-05-07T20:32:54.2248531Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2248791Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2248903Z                            module_map=module_map)
2025-05-07T20:32:54.2249063Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2249161Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2249241Z E       ^
2025-05-07T20:32:54.2249586Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2249594Z 
2025-05-07T20:32:54.2249999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2250010Z 
2025-05-07T20:32:54.2250110Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2250329Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2250412Z     T=128,
2025-05-07T20:32:54.2250493Z     D=5120,
2025-05-07T20:32:54.2250572Z     scale_ub=None,
2025-05-07T20:32:54.2250667Z     contiguous=False,
2025-05-07T20:32:54.2250749Z     compiled=False,
2025-05-07T20:32:54.2250822Z )
2025-05-07T20:32:54.2251042Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2251209Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.2251214Z 
2025-05-07T20:32:54.2251298Z     @given(
2025-05-07T20:32:54.2251419Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2251516Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2251636Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2251750Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2251863Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2251944Z     )
2025-05-07T20:32:54.2252227Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2252323Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2252443Z         self,
2025-05-07T20:32:54.2252525Z         T: int,
2025-05-07T20:32:54.2252603Z         D: int,
2025-05-07T20:32:54.2252714Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2252805Z         contiguous: bool,
2025-05-07T20:32:54.2252898Z         compiled: bool,
2025-05-07T20:32:54.2253094Z     ) -> None:
2025-05-07T20:32:54.2253189Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2253270Z     
2025-05-07T20:32:54.2253485Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2253559Z     
2025-05-07T20:32:54.2253673Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2253819Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2253925Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2254013Z         x0 = x[:, :D]
2025-05-07T20:32:54.2254094Z         x1 = x[:, D:]
2025-05-07T20:32:54.2254168Z     
2025-05-07T20:32:54.2254257Z         if contiguous:
2025-05-07T20:32:54.2254350Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2254488Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2254564Z     
2025-05-07T20:32:54.2254655Z         if scale_ub is not None:
2025-05-07T20:32:54.2254766Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2254900Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2254974Z             )
2025-05-07T20:32:54.2255054Z         else:
2025-05-07T20:32:54.2255148Z             scale_ub_tensor = None
2025-05-07T20:32:54.2255225Z     
2025-05-07T20:32:54.2255358Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2255447Z             op = silu_mul_quant
2025-05-07T20:32:54.2255530Z             if compiled:
2025-05-07T20:32:54.2255635Z                 op = torch.compile(op)
2025-05-07T20:32:54.2255738Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2255820Z     
2025-05-07T20:32:54.2255911Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2255915Z 
2025-05-07T20:32:54.2256008Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2256146Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2256246Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2256343Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2256837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2256935Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2257295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2257513Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2257846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2257945Z     kernel = self.compile(
2025-05-07T20:32:54.2258321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2258494Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2258625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2258629Z 
2025-05-07T20:32:54.2258831Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79fd29ee0>
2025-05-07T20:32:54.2263377Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2264005Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88b3bea20>}
2025-05-07T20:32:54.2265197Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2265398Z context = <triton._C.libtriton.ir.context object at 0x7fc79f70d8b0>
2025-05-07T20:32:54.2265405Z 
2025-05-07T20:32:54.2265575Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2265851Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2266050Z                            module_map=module_map)
2025-05-07T20:32:54.2266216Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2266325Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2266405Z E       ^
2025-05-07T20:32:54.2266769Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2266778Z 
2025-05-07T20:32:54.2267196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2267268Z 
2025-05-07T20:32:54.2267375Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2267608Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2267684Z     T=128,
2025-05-07T20:32:54.2267764Z     D=5120,
2025-05-07T20:32:54.2267849Z     scale_ub=1200.0,
2025-05-07T20:32:54.2267935Z     contiguous=True,
2025-05-07T20:32:54.2268026Z     compiled=False,
2025-05-07T20:32:54.2268106Z )
2025-05-07T20:32:54.2268325Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2268502Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.2268507Z 
2025-05-07T20:32:54.2268581Z     @given(
2025-05-07T20:32:54.2268702Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2268813Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2268930Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2269060Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2269175Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2269246Z     )
2025-05-07T20:32:54.2269499Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2269593Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2269672Z         self,
2025-05-07T20:32:54.2269755Z         T: int,
2025-05-07T20:32:54.2269840Z         D: int,
2025-05-07T20:32:54.2269939Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2270037Z         contiguous: bool,
2025-05-07T20:32:54.2270124Z         compiled: bool,
2025-05-07T20:32:54.2270208Z     ) -> None:
2025-05-07T20:32:54.2270311Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2270385Z     
2025-05-07T20:32:54.2270564Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2270641Z     
2025-05-07T20:32:54.2270734Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2270870Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2270961Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2271040Z         x0 = x[:, :D]
2025-05-07T20:32:54.2271125Z         x1 = x[:, D:]
2025-05-07T20:32:54.2271196Z     
2025-05-07T20:32:54.2271278Z         if contiguous:
2025-05-07T20:32:54.2271379Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2271467Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2271537Z     
2025-05-07T20:32:54.2271637Z         if scale_ub is not None:
2025-05-07T20:32:54.2271743Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2271878Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2271957Z             )
2025-05-07T20:32:54.2272031Z         else:
2025-05-07T20:32:54.2272134Z             scale_ub_tensor = None
2025-05-07T20:32:54.2272206Z     
2025-05-07T20:32:54.2272386Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2272481Z             op = silu_mul_quant
2025-05-07T20:32:54.2272607Z             if compiled:
2025-05-07T20:32:54.2272711Z                 op = torch.compile(op)
2025-05-07T20:32:54.2272826Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2272899Z     
2025-05-07T20:32:54.2272991Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2272996Z 
2025-05-07T20:32:54.2273098Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2273227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2273380Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2273481Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2273979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2274084Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2274440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2274704Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2275049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2275143Z     kernel = self.compile(
2025-05-07T20:32:54.2275526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2275699Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2275828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2275834Z 
2025-05-07T20:32:54.2276044Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79fb4eed0>
2025-05-07T20:32:54.2276815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2277328Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79fb39120>}
2025-05-07T20:32:54.2278062Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2278255Z context = <triton._C.libtriton.ir.context object at 0x7fc79f7e6a30>
2025-05-07T20:32:54.2278266Z 
2025-05-07T20:32:54.2278431Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2278690Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2278805Z                            module_map=module_map)
2025-05-07T20:32:54.2278968Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2279064Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2279151Z E       ^
2025-05-07T20:32:54.2279504Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2279509Z 
2025-05-07T20:32:54.2279924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2279929Z 
2025-05-07T20:32:54.2280032Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2280255Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2280335Z     T=1,
2025-05-07T20:32:54.2280414Z     D=7168,
2025-05-07T20:32:54.2280495Z     scale_ub=1200.0,
2025-05-07T20:32:54.2280589Z     contiguous=True,
2025-05-07T20:32:54.2280676Z     compiled=True,
2025-05-07T20:32:54.2280746Z )
2025-05-07T20:32:54.2280970Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2281177Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.2281249Z 
2025-05-07T20:32:54.2281335Z     @given(
2025-05-07T20:32:54.2281454Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2281551Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2281675Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2281791Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2281906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2282030Z     )
2025-05-07T20:32:54.2282274Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2282365Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2282446Z         self,
2025-05-07T20:32:54.2282519Z         T: int,
2025-05-07T20:32:54.2282601Z         D: int,
2025-05-07T20:32:54.2282720Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2282822Z         contiguous: bool,
2025-05-07T20:32:54.2282926Z         compiled: bool,
2025-05-07T20:32:54.2283006Z     ) -> None:
2025-05-07T20:32:54.2283145Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2283228Z     
2025-05-07T20:32:54.2283397Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2283471Z     
2025-05-07T20:32:54.2283570Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2283697Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2283787Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2283877Z         x0 = x[:, :D]
2025-05-07T20:32:54.2283958Z         x1 = x[:, D:]
2025-05-07T20:32:54.2284040Z     
2025-05-07T20:32:54.2284126Z         if contiguous:
2025-05-07T20:32:54.2284219Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2284316Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2284390Z     
2025-05-07T20:32:54.2284484Z         if scale_ub is not None:
2025-05-07T20:32:54.2284599Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2284736Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2284817Z             )
2025-05-07T20:32:54.2284904Z         else:
2025-05-07T20:32:54.2284997Z             scale_ub_tensor = None
2025-05-07T20:32:54.2285071Z     
2025-05-07T20:32:54.2285207Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2285300Z             op = silu_mul_quant
2025-05-07T20:32:54.2285385Z             if compiled:
2025-05-07T20:32:54.2285493Z                 op = torch.compile(op)
2025-05-07T20:32:54.2285601Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2285688Z     
2025-05-07T20:32:54.2285782Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2285786Z 
2025-05-07T20:32:54.2285885Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2286022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2286124Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2286226Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2286603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2286697Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2287197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2287296Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2287650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2287883Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2288218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2288313Z     kernel = self.compile(
2025-05-07T20:32:54.2288698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2288921Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2289093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2289099Z 
2025-05-07T20:32:54.2289306Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79fb4c1a0>
2025-05-07T20:32:54.2290073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2290646Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79fb3a2a0>}
2025-05-07T20:32:54.2291380Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2291616Z context = <triton._C.libtriton.ir.context object at 0x7fc79f658230>
2025-05-07T20:32:54.2291621Z 
2025-05-07T20:32:54.2291786Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2292049Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2292155Z                            module_map=module_map)
2025-05-07T20:32:54.2292316Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2292424Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2292500Z E       ^
2025-05-07T20:32:54.2292850Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2292855Z 
2025-05-07T20:32:54.2293386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2293393Z 
2025-05-07T20:32:54.2293496Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2293728Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2293804Z     T=1,
2025-05-07T20:32:54.2293880Z     D=7168,
2025-05-07T20:32:54.2293968Z     scale_ub=1200.0,
2025-05-07T20:32:54.2294054Z     contiguous=False,
2025-05-07T20:32:54.2294138Z     compiled=True,
2025-05-07T20:32:54.2294218Z )
2025-05-07T20:32:54.2294434Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2294601Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.2294612Z 
2025-05-07T20:32:54.2294689Z     @given(
2025-05-07T20:32:54.2294809Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2294913Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2295029Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2295148Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2295268Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2295344Z     )
2025-05-07T20:32:54.2295591Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2295690Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2295772Z         self,
2025-05-07T20:32:54.2295849Z         T: int,
2025-05-07T20:32:54.2295934Z         D: int,
2025-05-07T20:32:54.2296034Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2296128Z         contiguous: bool,
2025-05-07T20:32:54.2296216Z         compiled: bool,
2025-05-07T20:32:54.2296295Z     ) -> None:
2025-05-07T20:32:54.2296397Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2296470Z     
2025-05-07T20:32:54.2296637Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2296719Z     
2025-05-07T20:32:54.2296813Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2296989Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2297084Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2297165Z         x0 = x[:, :D]
2025-05-07T20:32:54.2297285Z         x1 = x[:, D:]
2025-05-07T20:32:54.2297367Z     
2025-05-07T20:32:54.2297453Z         if contiguous:
2025-05-07T20:32:54.2297550Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2297639Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2297715Z     
2025-05-07T20:32:54.2297812Z         if scale_ub is not None:
2025-05-07T20:32:54.2297917Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2298095Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2298175Z             )
2025-05-07T20:32:54.2298251Z         else:
2025-05-07T20:32:54.2298344Z             scale_ub_tensor = None
2025-05-07T20:32:54.2303302Z     
2025-05-07T20:32:54.2303459Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2303552Z             op = silu_mul_quant
2025-05-07T20:32:54.2303651Z             if compiled:
2025-05-07T20:32:54.2303753Z                 op = torch.compile(op)
2025-05-07T20:32:54.2303866Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2304011Z     
2025-05-07T20:32:54.2304104Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2304110Z 
2025-05-07T20:32:54.2304217Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2304350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2304452Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2304562Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2304940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2305034Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2305532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2305631Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2306002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2306229Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2306565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2306665Z     kernel = self.compile(
2025-05-07T20:32:54.2307046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2307226Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2307362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2307367Z 
2025-05-07T20:32:54.2307578Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79fb4fd40>
2025-05-07T20:32:54.2308357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2308861Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79fb3b9c0>}
2025-05-07T20:32:54.2309605Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2309800Z context = <triton._C.libtriton.ir.context object at 0x7fc79f609fb0>
2025-05-07T20:32:54.2309804Z 
2025-05-07T20:32:54.2309969Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2310232Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2310386Z                            module_map=module_map)
2025-05-07T20:32:54.2310546Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2310699Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2310777Z E       ^
2025-05-07T20:32:54.2311134Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2311139Z 
2025-05-07T20:32:54.2311546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2311591Z 
2025-05-07T20:32:54.2311694Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2311921Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2311999Z     T=1,
2025-05-07T20:32:54.2312079Z     D=7168,
2025-05-07T20:32:54.2312164Z     scale_ub=None,
2025-05-07T20:32:54.2312250Z     contiguous=False,
2025-05-07T20:32:54.2312337Z     compiled=True,
2025-05-07T20:32:54.2312413Z )
2025-05-07T20:32:54.2312628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2312840Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.2312845Z 
2025-05-07T20:32:54.2312927Z     @given(
2025-05-07T20:32:54.2313047Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2313154Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2313270Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2313393Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2313511Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2313584Z     )
2025-05-07T20:32:54.2313834Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2313929Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2314011Z         self,
2025-05-07T20:32:54.2314096Z         T: int,
2025-05-07T20:32:54.2314174Z         D: int,
2025-05-07T20:32:54.2314275Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2314370Z         contiguous: bool,
2025-05-07T20:32:54.2314457Z         compiled: bool,
2025-05-07T20:32:54.2314541Z     ) -> None:
2025-05-07T20:32:54.2314642Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2314716Z     
2025-05-07T20:32:54.2314894Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2314969Z     
2025-05-07T20:32:54.2315062Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2315194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2315285Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2315363Z         x0 = x[:, :D]
2025-05-07T20:32:54.2315450Z         x1 = x[:, D:]
2025-05-07T20:32:54.2315522Z     
2025-05-07T20:32:54.2315603Z         if contiguous:
2025-05-07T20:32:54.2315703Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2315791Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2315863Z     
2025-05-07T20:32:54.2315964Z         if scale_ub is not None:
2025-05-07T20:32:54.2316068Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2316207Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2316291Z             )
2025-05-07T20:32:54.2316369Z         else:
2025-05-07T20:32:54.2316472Z             scale_ub_tensor = None
2025-05-07T20:32:54.2316544Z     
2025-05-07T20:32:54.2316673Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2316768Z             op = silu_mul_quant
2025-05-07T20:32:54.2316853Z             if compiled:
2025-05-07T20:32:54.2316958Z                 op = torch.compile(op)
2025-05-07T20:32:54.2317072Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2317145Z     
2025-05-07T20:32:54.2317237Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.2317364Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.2317439Z     
2025-05-07T20:32:54.2317575Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2317734Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.2317835Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.2318005Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.2318150Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.2318224Z     
2025-05-07T20:32:54.2318331Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.2318336Z 
2025-05-07T20:32:54.2318436Z moe/activation_test.py:126: 
2025-05-07T20:32:54.2318567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2318721Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.2318855Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.2319420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.2319521Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.2319879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2320146Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2320511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.2320765Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.2321145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.2321314Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.2321658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.2321737Z     fn()
2025-05-07T20:32:54.2322132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.2322224Z     self.fn.run(
2025-05-07T20:32:54.2322562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2322665Z     kernel = self.compile(
2025-05-07T20:32:54.2323039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2323211Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2323345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2323353Z 
2025-05-07T20:32:54.2323558Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f6df140>
2025-05-07T20:32:54.2324326Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2324838Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f8ccb80>}
2025-05-07T20:32:54.2325573Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2325771Z context = <triton._C.libtriton.ir.context object at 0x7fc79f883030>
2025-05-07T20:32:54.2325778Z 
2025-05-07T20:32:54.2325943Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2326206Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2326313Z                            module_map=module_map)
2025-05-07T20:32:54.2326480Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2326636Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.2326713Z E       ^
2025-05-07T20:32:54.2327129Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2327142Z 
2025-05-07T20:32:54.2327553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2327558Z 
2025-05-07T20:32:54.2327658Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2327889Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2328007Z     T=1,
2025-05-07T20:32:54.2328085Z     D=5120,
2025-05-07T20:32:54.2328178Z     scale_ub=1200.0,
2025-05-07T20:32:54.2328264Z     contiguous=False,
2025-05-07T20:32:54.2328347Z     compiled=True,
2025-05-07T20:32:54.2328433Z )
2025-05-07T20:32:54.2328648Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2328825Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.2328830Z 
2025-05-07T20:32:54.2328905Z     @given(
2025-05-07T20:32:54.2329067Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2329176Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2329291Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2329410Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2329535Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2329610Z     )
2025-05-07T20:32:54.2329853Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2329957Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2330033Z         self,
2025-05-07T20:32:54.2330117Z         T: int,
2025-05-07T20:32:54.2330195Z         D: int,
2025-05-07T20:32:54.2330295Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2330389Z         contiguous: bool,
2025-05-07T20:32:54.2330475Z         compiled: bool,
2025-05-07T20:32:54.2330553Z     ) -> None:
2025-05-07T20:32:54.2330656Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2330731Z     
2025-05-07T20:32:54.2330903Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2330985Z     
2025-05-07T20:32:54.2331078Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2331202Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2331297Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2331376Z         x0 = x[:, :D]
2025-05-07T20:32:54.2331453Z         x1 = x[:, D:]
2025-05-07T20:32:54.2331539Z     
2025-05-07T20:32:54.2331622Z         if contiguous:
2025-05-07T20:32:54.2331719Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2331807Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2331886Z     
2025-05-07T20:32:54.2331984Z         if scale_ub is not None:
2025-05-07T20:32:54.2332090Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2332223Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2332309Z             )
2025-05-07T20:32:54.2332384Z         else:
2025-05-07T20:32:54.2332481Z             scale_ub_tensor = None
2025-05-07T20:32:54.2332569Z     
2025-05-07T20:32:54.2332699Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2332793Z             op = silu_mul_quant
2025-05-07T20:32:54.2332907Z             if compiled:
2025-05-07T20:32:54.2333103Z                 op = torch.compile(op)
2025-05-07T20:32:54.2333221Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2333293Z     
2025-05-07T20:32:54.2333389Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2333394Z 
2025-05-07T20:32:54.2333497Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2333627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2333732Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2333839Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2334253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2334358Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2334885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2334984Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2335343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2335564Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2335939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2336041Z     kernel = self.compile(
2025-05-07T20:32:54.2336415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2336600Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2336731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2336788Z 
2025-05-07T20:32:54.2336995Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f6de9c0>
2025-05-07T20:32:54.2337769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2338271Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f8cde40>}
2025-05-07T20:32:54.2339012Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2339205Z context = <triton._C.libtriton.ir.context object at 0x7fc79f2585b0>
2025-05-07T20:32:54.2339210Z 
2025-05-07T20:32:54.2339379Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2339645Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2339755Z                            module_map=module_map)
2025-05-07T20:32:54.2339922Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2340025Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2340106Z E       ^
2025-05-07T20:32:54.2340465Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2340470Z 
2025-05-07T20:32:54.2340877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2340882Z 
2025-05-07T20:32:54.2340996Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2341216Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2341296Z     T=1,
2025-05-07T20:32:54.2341389Z     D=5120,
2025-05-07T20:32:54.2341472Z     scale_ub=1200.0,
2025-05-07T20:32:54.2341559Z     contiguous=False,
2025-05-07T20:32:54.2341652Z     compiled=False,
2025-05-07T20:32:54.2341725Z )
2025-05-07T20:32:54.2341940Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2342114Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.2342121Z 
2025-05-07T20:32:54.2342199Z     @given(
2025-05-07T20:32:54.2342325Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2342424Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2342537Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2342658Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2342820Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2342897Z     )
2025-05-07T20:32:54.2343189Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2343283Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2343357Z         self,
2025-05-07T20:32:54.2343441Z         T: int,
2025-05-07T20:32:54.2343514Z         D: int,
2025-05-07T20:32:54.2343619Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2343709Z         contiguous: bool,
2025-05-07T20:32:54.2343793Z         compiled: bool,
2025-05-07T20:32:54.2343879Z     ) -> None:
2025-05-07T20:32:54.2344019Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2344095Z     
2025-05-07T20:32:54.2344267Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2344339Z     
2025-05-07T20:32:54.2344432Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2344565Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2344653Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2344735Z         x0 = x[:, :D]
2025-05-07T20:32:54.2344819Z         x1 = x[:, D:]
2025-05-07T20:32:54.2344890Z     
2025-05-07T20:32:54.2345016Z         if contiguous:
2025-05-07T20:32:54.2345116Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2345205Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2345284Z     
2025-05-07T20:32:54.2345375Z         if scale_ub is not None:
2025-05-07T20:32:54.2345480Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2345618Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2345696Z             )
2025-05-07T20:32:54.2345772Z         else:
2025-05-07T20:32:54.2345873Z             scale_ub_tensor = None
2025-05-07T20:32:54.2345950Z     
2025-05-07T20:32:54.2346079Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2346175Z             op = silu_mul_quant
2025-05-07T20:32:54.2346259Z             if compiled:
2025-05-07T20:32:54.2346359Z                 op = torch.compile(op)
2025-05-07T20:32:54.2346472Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2346543Z     
2025-05-07T20:32:54.2346645Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2346652Z 
2025-05-07T20:32:54.2346749Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2346877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2346982Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2347085Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2347579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2347688Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2348042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2348271Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2348610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2348705Z     kernel = self.compile(
2025-05-07T20:32:54.2349094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2349269Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2349395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2349408Z 
2025-05-07T20:32:54.2349615Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f6de300>
2025-05-07T20:32:54.2350379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2350882Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f8ceac0>}
2025-05-07T20:32:54.2351707Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2351907Z context = <triton._C.libtriton.ir.context object at 0x7fc79fff78f0>
2025-05-07T20:32:54.2351911Z 
2025-05-07T20:32:54.2352073Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2352373Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2352487Z                            module_map=module_map)
2025-05-07T20:32:54.2352648Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2352758Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2352833Z E       ^
2025-05-07T20:32:54.2353181Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2353189Z 
2025-05-07T20:32:54.2354286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2354292Z 
2025-05-07T20:32:54.2354397Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2354621Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2354706Z     T=16384,
2025-05-07T20:32:54.2354783Z     D=5120,
2025-05-07T20:32:54.2354875Z     scale_ub=1200.0,
2025-05-07T20:32:54.2354964Z     contiguous=False,
2025-05-07T20:32:54.2355047Z     compiled=True,
2025-05-07T20:32:54.2355130Z )
2025-05-07T20:32:54.2355348Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2355533Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.2355538Z 
2025-05-07T20:32:54.2355628Z     @given(
2025-05-07T20:32:54.2355750Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2355851Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2355986Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2356097Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2356216Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2356289Z     )
2025-05-07T20:32:54.2356533Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2356633Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2356710Z         self,
2025-05-07T20:32:54.2356794Z         T: int,
2025-05-07T20:32:54.2356870Z         D: int,
2025-05-07T20:32:54.2356966Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2357062Z         contiguous: bool,
2025-05-07T20:32:54.2357148Z         compiled: bool,
2025-05-07T20:32:54.2357227Z     ) -> None:
2025-05-07T20:32:54.2357325Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2357403Z     
2025-05-07T20:32:54.2357570Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2357649Z     
2025-05-07T20:32:54.2357746Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2357872Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2357965Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2358047Z         x0 = x[:, :D]
2025-05-07T20:32:54.2358134Z         x1 = x[:, D:]
2025-05-07T20:32:54.2358208Z     
2025-05-07T20:32:54.2358290Z         if contiguous:
2025-05-07T20:32:54.2358385Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2358475Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2358549Z     
2025-05-07T20:32:54.2358647Z         if scale_ub is not None:
2025-05-07T20:32:54.2358750Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2358884Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2358962Z             )
2025-05-07T20:32:54.2359038Z         else:
2025-05-07T20:32:54.2359436Z             scale_ub_tensor = None
2025-05-07T20:32:54.2359558Z     
2025-05-07T20:32:54.2359857Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2359950Z             op = silu_mul_quant
2025-05-07T20:32:54.2360039Z             if compiled:
2025-05-07T20:32:54.2360137Z                 op = torch.compile(op)
2025-05-07T20:32:54.2360246Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2360318Z     
2025-05-07T20:32:54.2360405Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2360410Z 
2025-05-07T20:32:54.2360512Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2360705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2360805Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2360910Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2361273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2361377Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2361927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2362027Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2362382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2362602Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2362933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2363037Z     kernel = self.compile(
2025-05-07T20:32:54.2363410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2363587Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2363711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2363718Z 
2025-05-07T20:32:54.2363925Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79ff710d0>
2025-05-07T20:32:54.2364696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2365196Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79ff4c180>}
2025-05-07T20:32:54.2365940Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2366128Z context = <triton._C.libtriton.ir.context object at 0x7fc79ffe8970>
2025-05-07T20:32:54.2366135Z 
2025-05-07T20:32:54.2366306Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2366567Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2366674Z                            module_map=module_map)
2025-05-07T20:32:54.2366849Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2366948Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2367024Z E       ^
2025-05-07T20:32:54.2367378Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2367385Z 
2025-05-07T20:32:54.2367790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2367795Z 
2025-05-07T20:32:54.2367898Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2368125Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2368266Z     T=2048,
2025-05-07T20:32:54.2368348Z     D=7168,
2025-05-07T20:32:54.2368432Z     scale_ub=1200.0,
2025-05-07T20:32:54.2368559Z     contiguous=False,
2025-05-07T20:32:54.2368647Z     compiled=True,
2025-05-07T20:32:54.2368722Z )
2025-05-07T20:32:54.2368936Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2369115Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.2369120Z 
2025-05-07T20:32:54.2369195Z     @given(
2025-05-07T20:32:54.2369314Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2369461Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2369575Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2369700Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2369814Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2369886Z     )
2025-05-07T20:32:54.2370136Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2370230Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2370310Z         self,
2025-05-07T20:32:54.2370436Z         T: int,
2025-05-07T20:32:54.2370514Z         D: int,
2025-05-07T20:32:54.2370611Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2370706Z         contiguous: bool,
2025-05-07T20:32:54.2370793Z         compiled: bool,
2025-05-07T20:32:54.2370870Z     ) -> None:
2025-05-07T20:32:54.2370977Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2371047Z     
2025-05-07T20:32:54.2371224Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2371300Z     
2025-05-07T20:32:54.2371393Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2371523Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2371610Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2371690Z         x0 = x[:, :D]
2025-05-07T20:32:54.2371774Z         x1 = x[:, D:]
2025-05-07T20:32:54.2371851Z     
2025-05-07T20:32:54.2371936Z         if contiguous:
2025-05-07T20:32:54.2372032Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2372129Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2372201Z     
2025-05-07T20:32:54.2372296Z         if scale_ub is not None:
2025-05-07T20:32:54.2372400Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2372538Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2372610Z             )
2025-05-07T20:32:54.2372687Z         else:
2025-05-07T20:32:54.2372793Z             scale_ub_tensor = None
2025-05-07T20:32:54.2372880Z     
2025-05-07T20:32:54.2373106Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2373204Z             op = silu_mul_quant
2025-05-07T20:32:54.2373288Z             if compiled:
2025-05-07T20:32:54.2373387Z                 op = torch.compile(op)
2025-05-07T20:32:54.2373499Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2373575Z     
2025-05-07T20:32:54.2373665Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2373669Z 
2025-05-07T20:32:54.2373772Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2373904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2374012Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2374109Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2374467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2374563Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2375050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2375147Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2375504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2375724Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2376155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2376252Z     kernel = self.compile(
2025-05-07T20:32:54.2376627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2376805Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2376930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2376974Z 
2025-05-07T20:32:54.2377183Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79ff73e30>
2025-05-07T20:32:54.2377943Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2378442Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79ff4cea0>}
2025-05-07T20:32:54.2379215Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2379404Z context = <triton._C.libtriton.ir.context object at 0x7fc79ffe4b70>
2025-05-07T20:32:54.2379409Z 
2025-05-07T20:32:54.2379577Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2379837Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2379946Z                            module_map=module_map)
2025-05-07T20:32:54.2380109Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2380206Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2380289Z E       ^
2025-05-07T20:32:54.2380638Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2380645Z 
2025-05-07T20:32:54.2381050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2381054Z 
2025-05-07T20:32:54.2381163Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2381381Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2381454Z     T=1,
2025-05-07T20:32:54.2381540Z     D=5120,
2025-05-07T20:32:54.2381620Z     scale_ub=None,
2025-05-07T20:32:54.2381712Z     contiguous=False,
2025-05-07T20:32:54.2381797Z     compiled=False,
2025-05-07T20:32:54.2381867Z )
2025-05-07T20:32:54.2382094Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2382259Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.2382266Z 
2025-05-07T20:32:54.2382342Z     @given(
2025-05-07T20:32:54.2382466Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2382569Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2382683Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2382806Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2382918Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2382998Z     )
2025-05-07T20:32:54.2383241Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2383335Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2383419Z         self,
2025-05-07T20:32:54.2383499Z         T: int,
2025-05-07T20:32:54.2383573Z         D: int,
2025-05-07T20:32:54.2383678Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2383769Z         contiguous: bool,
2025-05-07T20:32:54.2383853Z         compiled: bool,
2025-05-07T20:32:54.2383936Z     ) -> None:
2025-05-07T20:32:54.2384079Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2384152Z     
2025-05-07T20:32:54.2384365Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2384443Z     
2025-05-07T20:32:54.2384542Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2384668Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2384758Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2384843Z         x0 = x[:, :D]
2025-05-07T20:32:54.2384923Z         x1 = x[:, D:]
2025-05-07T20:32:54.2384998Z     
2025-05-07T20:32:54.2385089Z         if contiguous:
2025-05-07T20:32:54.2385222Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2385309Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2385383Z     
2025-05-07T20:32:54.2385473Z         if scale_ub is not None:
2025-05-07T20:32:54.2385578Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2385714Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2385791Z             )
2025-05-07T20:32:54.2385872Z         else:
2025-05-07T20:32:54.2385966Z             scale_ub_tensor = None
2025-05-07T20:32:54.2386045Z     
2025-05-07T20:32:54.2386245Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2386335Z             op = silu_mul_quant
2025-05-07T20:32:54.2386418Z             if compiled:
2025-05-07T20:32:54.2386527Z                 op = torch.compile(op)
2025-05-07T20:32:54.2386630Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2386702Z     
2025-05-07T20:32:54.2386800Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2386808Z 
2025-05-07T20:32:54.2386903Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2387031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2387137Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2387236Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2387733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2387831Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2388187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2388415Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2388747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2388847Z     kernel = self.compile(
2025-05-07T20:32:54.2389224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2389396Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2389525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2389529Z 
2025-05-07T20:32:54.2389730Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79ff732c0>
2025-05-07T20:32:54.2390495Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2390996Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79ff4de40>}
2025-05-07T20:32:54.2391726Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2391920Z context = <triton._C.libtriton.ir.context object at 0x7fc88a0d31f0>
2025-05-07T20:32:54.2391924Z 
2025-05-07T20:32:54.2392086Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2392391Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2392539Z                            module_map=module_map)
2025-05-07T20:32:54.2392702Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2392808Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2392882Z E       ^
2025-05-07T20:32:54.2393226Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2393237Z 
2025-05-07T20:32:54.2393640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2393686Z 
2025-05-07T20:32:54.2393785Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2394009Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2394082Z     T=4096,
2025-05-07T20:32:54.2394160Z     D=7168,
2025-05-07T20:32:54.2394244Z     scale_ub=1200.0,
2025-05-07T20:32:54.2394331Z     contiguous=False,
2025-05-07T20:32:54.2394413Z     compiled=False,
2025-05-07T20:32:54.2394488Z )
2025-05-07T20:32:54.2394822Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2395004Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.2395008Z 
2025-05-07T20:32:54.2395080Z     @given(
2025-05-07T20:32:54.2395194Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2395297Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2395410Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2395524Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2395640Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2395710Z     )
2025-05-07T20:32:54.2395949Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2396041Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2396117Z         self,
2025-05-07T20:32:54.2396199Z         T: int,
2025-05-07T20:32:54.2396273Z         D: int,
2025-05-07T20:32:54.2396375Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2396471Z         contiguous: bool,
2025-05-07T20:32:54.2396553Z         compiled: bool,
2025-05-07T20:32:54.2396630Z     ) -> None:
2025-05-07T20:32:54.2396726Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2396797Z     
2025-05-07T20:32:54.2396961Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2397036Z     
2025-05-07T20:32:54.2397129Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2397252Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2397343Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2397423Z         x0 = x[:, :D]
2025-05-07T20:32:54.2397504Z         x1 = x[:, D:]
2025-05-07T20:32:54.2397572Z     
2025-05-07T20:32:54.2397653Z         if contiguous:
2025-05-07T20:32:54.2397744Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2397832Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2397900Z     
2025-05-07T20:32:54.2397993Z         if scale_ub is not None:
2025-05-07T20:32:54.2398104Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2398236Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2398314Z             )
2025-05-07T20:32:54.2398384Z         else:
2025-05-07T20:32:54.2398475Z             scale_ub_tensor = None
2025-05-07T20:32:54.2398550Z     
2025-05-07T20:32:54.2398678Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2398771Z             op = silu_mul_quant
2025-05-07T20:32:54.2398860Z             if compiled:
2025-05-07T20:32:54.2398956Z                 op = torch.compile(op)
2025-05-07T20:32:54.2399064Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2399132Z     
2025-05-07T20:32:54.2399221Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2399226Z 
2025-05-07T20:32:54.2399326Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2399504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2399642Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2399746Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2400235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2400335Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2400686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2400946Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2401283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2401373Z     kernel = self.compile(
2025-05-07T20:32:54.2401748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2401927Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2402091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2402097Z 
2025-05-07T20:32:54.2402306Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79ff72d80>
2025-05-07T20:32:54.2403064Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2403561Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79ff4f380>}
2025-05-07T20:32:54.2404296Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2404495Z context = <triton._C.libtriton.ir.context object at 0x7fc79f274a70>
2025-05-07T20:32:54.2404502Z 
2025-05-07T20:32:54.2404670Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2404925Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2405034Z                            module_map=module_map)
2025-05-07T20:32:54.2405192Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2405291Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2405371Z E       ^
2025-05-07T20:32:54.2405718Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2405723Z 
2025-05-07T20:32:54.2406125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2406132Z 
2025-05-07T20:32:54.2406237Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2406459Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2406540Z     T=16384,
2025-05-07T20:32:54.2406619Z     D=7168,
2025-05-07T20:32:54.2406698Z     scale_ub=None,
2025-05-07T20:32:54.2406787Z     contiguous=True,
2025-05-07T20:32:54.2406867Z     compiled=True,
2025-05-07T20:32:54.2406939Z )
2025-05-07T20:32:54.2407155Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2407326Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.2407331Z 
2025-05-07T20:32:54.2407406Z     @given(
2025-05-07T20:32:54.2407530Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2407628Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2407744Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2407910Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2408019Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2408134Z     )
2025-05-07T20:32:54.2408379Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2408470Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2408546Z         self,
2025-05-07T20:32:54.2408621Z         T: int,
2025-05-07T20:32:54.2408692Z         D: int,
2025-05-07T20:32:54.2408793Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2408880Z         contiguous: bool,
2025-05-07T20:32:54.2409003Z         compiled: bool,
2025-05-07T20:32:54.2409084Z     ) -> None:
2025-05-07T20:32:54.2409174Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2409250Z     
2025-05-07T20:32:54.2409415Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2409487Z     
2025-05-07T20:32:54.2409582Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2409708Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2409797Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2409878Z         x0 = x[:, :D]
2025-05-07T20:32:54.2410000Z         x1 = x[:, D:]
2025-05-07T20:32:54.2410072Z     
2025-05-07T20:32:54.2410155Z         if contiguous:
2025-05-07T20:32:54.2410244Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2410329Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2410403Z     
2025-05-07T20:32:54.2410493Z         if scale_ub is not None:
2025-05-07T20:32:54.2410598Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2410736Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2410808Z             )
2025-05-07T20:32:54.2410887Z         else:
2025-05-07T20:32:54.2410979Z             scale_ub_tensor = None
2025-05-07T20:32:54.2411050Z     
2025-05-07T20:32:54.2411188Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2411275Z             op = silu_mul_quant
2025-05-07T20:32:54.2411358Z             if compiled:
2025-05-07T20:32:54.2411464Z                 op = torch.compile(op)
2025-05-07T20:32:54.2411568Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2411643Z     
2025-05-07T20:32:54.2411738Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2411742Z 
2025-05-07T20:32:54.2411835Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2411968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2412066Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2412164Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2412533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2412623Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2413156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2413256Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2413608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2413837Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2414168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2414256Z     kernel = self.compile(
2025-05-07T20:32:54.2414633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2414806Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2414928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2414937Z 
2025-05-07T20:32:54.2415139Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a3ac6b0>
2025-05-07T20:32:54.2415956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2416492Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a3f44a0>}
2025-05-07T20:32:54.2417219Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2417473Z context = <triton._C.libtriton.ir.context object at 0x7fc88a41c030>
2025-05-07T20:32:54.2417477Z 
2025-05-07T20:32:54.2417639Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2417893Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2418002Z                            module_map=module_map)
2025-05-07T20:32:54.2418159Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2418300Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2418376Z E       ^
2025-05-07T20:32:54.2418723Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2418728Z 
2025-05-07T20:32:54.2419135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2419139Z 
2025-05-07T20:32:54.2419241Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2419458Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2419538Z     T=4096,
2025-05-07T20:32:54.2419617Z     D=5120,
2025-05-07T20:32:54.2419699Z     scale_ub=None,
2025-05-07T20:32:54.2419784Z     contiguous=False,
2025-05-07T20:32:54.2419863Z     compiled=True,
2025-05-07T20:32:54.2419945Z )
2025-05-07T20:32:54.2420159Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2420332Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.2420336Z 
2025-05-07T20:32:54.2420416Z     @given(
2025-05-07T20:32:54.2420531Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2420626Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2420745Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2420857Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2420974Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2421047Z     )
2025-05-07T20:32:54.2421287Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2421383Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2421460Z         self,
2025-05-07T20:32:54.2421536Z         T: int,
2025-05-07T20:32:54.2421614Z         D: int,
2025-05-07T20:32:54.2421713Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2421800Z         contiguous: bool,
2025-05-07T20:32:54.2421886Z         compiled: bool,
2025-05-07T20:32:54.2421969Z     ) -> None:
2025-05-07T20:32:54.2422059Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2422138Z     
2025-05-07T20:32:54.2422301Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2422378Z     
2025-05-07T20:32:54.2422466Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2422589Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2422686Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2422765Z         x0 = x[:, :D]
2025-05-07T20:32:54.2422843Z         x1 = x[:, D:]
2025-05-07T20:32:54.2422915Z     
2025-05-07T20:32:54.2422996Z         if contiguous:
2025-05-07T20:32:54.2423084Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2423173Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2423242Z     
2025-05-07T20:32:54.2423380Z         if scale_ub is not None:
2025-05-07T20:32:54.2423488Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2423658Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2423741Z             )
2025-05-07T20:32:54.2427210Z         else:
2025-05-07T20:32:54.2427316Z             scale_ub_tensor = None
2025-05-07T20:32:54.2427388Z     
2025-05-07T20:32:54.2427527Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2427616Z             op = silu_mul_quant
2025-05-07T20:32:54.2427702Z             if compiled:
2025-05-07T20:32:54.2427869Z                 op = torch.compile(op)
2025-05-07T20:32:54.2427972Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2428047Z     
2025-05-07T20:32:54.2428134Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2428139Z 
2025-05-07T20:32:54.2428236Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2428370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2428471Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2428567Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2428989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2429081Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2429571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2429665Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2430014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2430242Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2430575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2430672Z     kernel = self.compile(
2025-05-07T20:32:54.2431049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2431226Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2431357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2431361Z 
2025-05-07T20:32:54.2431564Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a3ad040>
2025-05-07T20:32:54.2432326Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2432827Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a3f51c0>}
2025-05-07T20:32:54.2433559Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2433758Z context = <triton._C.libtriton.ir.context object at 0x7fc79fc1c070>
2025-05-07T20:32:54.2433763Z 
2025-05-07T20:32:54.2433926Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2434190Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2434296Z                            module_map=module_map)
2025-05-07T20:32:54.2434458Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2434558Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2434631Z E       ^
2025-05-07T20:32:54.2434977Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2434990Z 
2025-05-07T20:32:54.2435394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2435444Z 
2025-05-07T20:32:54.2435586Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2435812Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2435885Z     T=4096,
2025-05-07T20:32:54.2435963Z     D=5120,
2025-05-07T20:32:54.2436049Z     scale_ub=1200.0,
2025-05-07T20:32:54.2436130Z     contiguous=False,
2025-05-07T20:32:54.2436211Z     compiled=False,
2025-05-07T20:32:54.2436288Z )
2025-05-07T20:32:54.2436500Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2436716Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.2436721Z 
2025-05-07T20:32:54.2436798Z     @given(
2025-05-07T20:32:54.2436915Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2437017Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2437132Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2437246Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2437402Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2437476Z     )
2025-05-07T20:32:54.2437716Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2437813Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2437886Z         self,
2025-05-07T20:32:54.2437963Z         T: int,
2025-05-07T20:32:54.2438035Z         D: int,
2025-05-07T20:32:54.2438130Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2438225Z         contiguous: bool,
2025-05-07T20:32:54.2438310Z         compiled: bool,
2025-05-07T20:32:54.2438383Z     ) -> None:
2025-05-07T20:32:54.2438480Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2438548Z     
2025-05-07T20:32:54.2438713Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2438789Z     
2025-05-07T20:32:54.2438879Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2439000Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2439091Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2439172Z         x0 = x[:, :D]
2025-05-07T20:32:54.2439252Z         x1 = x[:, D:]
2025-05-07T20:32:54.2439322Z     
2025-05-07T20:32:54.2439401Z         if contiguous:
2025-05-07T20:32:54.2439492Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2439576Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2439647Z     
2025-05-07T20:32:54.2439738Z         if scale_ub is not None:
2025-05-07T20:32:54.2439844Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2439973Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2440047Z             )
2025-05-07T20:32:54.2440123Z         else:
2025-05-07T20:32:54.2440216Z             scale_ub_tensor = None
2025-05-07T20:32:54.2440294Z     
2025-05-07T20:32:54.2440422Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2440513Z             op = silu_mul_quant
2025-05-07T20:32:54.2440599Z             if compiled:
2025-05-07T20:32:54.2440697Z                 op = torch.compile(op)
2025-05-07T20:32:54.2440808Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2440878Z     
2025-05-07T20:32:54.2440966Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2440970Z 
2025-05-07T20:32:54.2441070Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2441195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2441294Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2441395Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2441886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2441985Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2442336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2442605Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2442980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2443073Z     kernel = self.compile(
2025-05-07T20:32:54.2443450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2443628Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2443751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2443794Z 
2025-05-07T20:32:54.2444000Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a3af860>
2025-05-07T20:32:54.2444757Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2445294Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a3f6160>}
2025-05-07T20:32:54.2446031Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2446221Z context = <triton._C.libtriton.ir.context object at 0x7fc79fcf2430>
2025-05-07T20:32:54.2446228Z 
2025-05-07T20:32:54.2446395Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2446650Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2446759Z                            module_map=module_map)
2025-05-07T20:32:54.2446919Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2447019Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2447098Z E       ^
2025-05-07T20:32:54.2447448Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2447453Z 
2025-05-07T20:32:54.2447856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2447861Z 
2025-05-07T20:32:54.2447963Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2448179Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2448260Z     T=4096,
2025-05-07T20:32:54.2448334Z     D=5120,
2025-05-07T20:32:54.2448415Z     scale_ub=1200.0,
2025-05-07T20:32:54.2448500Z     contiguous=False,
2025-05-07T20:32:54.2448582Z     compiled=True,
2025-05-07T20:32:54.2448655Z )
2025-05-07T20:32:54.2448873Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2449044Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.2449049Z 
2025-05-07T20:32:54.2449123Z     @given(
2025-05-07T20:32:54.2449249Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2449345Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2449461Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2449576Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2449689Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2449761Z     )
2025-05-07T20:32:54.2450004Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2450092Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2450170Z         self,
2025-05-07T20:32:54.2450247Z         T: int,
2025-05-07T20:32:54.2450322Z         D: int,
2025-05-07T20:32:54.2450423Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2450509Z         contiguous: bool,
2025-05-07T20:32:54.2450634Z         compiled: bool,
2025-05-07T20:32:54.2450715Z     ) -> None:
2025-05-07T20:32:54.2450805Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2450942Z     
2025-05-07T20:32:54.2451111Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2451182Z     
2025-05-07T20:32:54.2451273Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2451395Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2451478Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2451557Z         x0 = x[:, :D]
2025-05-07T20:32:54.2451635Z         x1 = x[:, D:]
2025-05-07T20:32:54.2451745Z     
2025-05-07T20:32:54.2451828Z         if contiguous:
2025-05-07T20:32:54.2451917Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2452001Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2452074Z     
2025-05-07T20:32:54.2452161Z         if scale_ub is not None:
2025-05-07T20:32:54.2452267Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2452402Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2452475Z             )
2025-05-07T20:32:54.2452554Z         else:
2025-05-07T20:32:54.2452690Z             scale_ub_tensor = None
2025-05-07T20:32:54.2452765Z     
2025-05-07T20:32:54.2452899Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2453071Z             op = silu_mul_quant
2025-05-07T20:32:54.2453171Z             if compiled:
2025-05-07T20:32:54.2453271Z                 op = torch.compile(op)
2025-05-07T20:32:54.2453374Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2453447Z     
2025-05-07T20:32:54.2453540Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2453544Z 
2025-05-07T20:32:54.2453638Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2453768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2453866Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2453963Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2454327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2454423Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2454905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2455002Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2455348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2455569Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2455902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2455994Z     kernel = self.compile(
2025-05-07T20:32:54.2456373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2456546Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2456674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2456684Z 
2025-05-07T20:32:54.2456888Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a3af410>
2025-05-07T20:32:54.2457646Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2458148Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a3f7240>}
2025-05-07T20:32:54.2458876Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2459113Z context = <triton._C.libtriton.ir.context object at 0x7fc79f22e2f0>
2025-05-07T20:32:54.2459117Z 
2025-05-07T20:32:54.2459713Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2459998Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2460111Z                            module_map=module_map)
2025-05-07T20:32:54.2460272Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2460378Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2460527Z E       ^
2025-05-07T20:32:54.2460873Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2460877Z 
2025-05-07T20:32:54.2461284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2461292Z 
2025-05-07T20:32:54.2461391Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2461610Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2461745Z     T=2048,
2025-05-07T20:32:54.2461819Z     D=7168,
2025-05-07T20:32:54.2461902Z     scale_ub=1200.0,
2025-05-07T20:32:54.2461984Z     contiguous=False,
2025-05-07T20:32:54.2462065Z     compiled=False,
2025-05-07T20:32:54.2462143Z )
2025-05-07T20:32:54.2462354Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2462526Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.2462534Z 
2025-05-07T20:32:54.2462612Z     @given(
2025-05-07T20:32:54.2462728Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2462825Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2462941Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2463056Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2463198Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2463277Z     )
2025-05-07T20:32:54.2463539Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2463633Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2463705Z         self,
2025-05-07T20:32:54.2463781Z         T: int,
2025-05-07T20:32:54.2463860Z         D: int,
2025-05-07T20:32:54.2463954Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2464042Z         contiguous: bool,
2025-05-07T20:32:54.2464127Z         compiled: bool,
2025-05-07T20:32:54.2464205Z     ) -> None:
2025-05-07T20:32:54.2464298Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2464376Z     
2025-05-07T20:32:54.2464541Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2464618Z     
2025-05-07T20:32:54.2464709Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2464830Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2464920Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2464997Z         x0 = x[:, :D]
2025-05-07T20:32:54.2465073Z         x1 = x[:, D:]
2025-05-07T20:32:54.2465151Z     
2025-05-07T20:32:54.2465237Z         if contiguous:
2025-05-07T20:32:54.2465322Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2465411Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2465483Z     
2025-05-07T20:32:54.2465570Z         if scale_ub is not None:
2025-05-07T20:32:54.2465675Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2465806Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2465888Z             )
2025-05-07T20:32:54.2465963Z         else:
2025-05-07T20:32:54.2466054Z             scale_ub_tensor = None
2025-05-07T20:32:54.2466127Z     
2025-05-07T20:32:54.2466250Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2466336Z             op = silu_mul_quant
2025-05-07T20:32:54.2466423Z             if compiled:
2025-05-07T20:32:54.2466586Z                 op = torch.compile(op)
2025-05-07T20:32:54.2466687Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2466763Z     
2025-05-07T20:32:54.2466891Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2466896Z 
2025-05-07T20:32:54.2466991Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2467121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2467217Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2467316Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2467804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2467940Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2468291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2468506Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2468843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2468978Z     kernel = self.compile(
2025-05-07T20:32:54.2469349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2469525Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2469651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2469655Z 
2025-05-07T20:32:54.2469860Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f296000>
2025-05-07T20:32:54.2470622Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2471116Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f230220>}
2025-05-07T20:32:54.2471861Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2472048Z context = <triton._C.libtriton.ir.context object at 0x7fc79f226a70>
2025-05-07T20:32:54.2472052Z 
2025-05-07T20:32:54.2472217Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2472474Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2472582Z                            module_map=module_map)
2025-05-07T20:32:54.2472740Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2472833Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2472908Z E       ^
2025-05-07T20:32:54.2473261Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2473266Z 
2025-05-07T20:32:54.2473675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2473680Z 
2025-05-07T20:32:54.2473780Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2473997Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2474073Z     T=1,
2025-05-07T20:32:54.2474149Z     D=7168,
2025-05-07T20:32:54.2474232Z     scale_ub=None,
2025-05-07T20:32:54.2474318Z     contiguous=True,
2025-05-07T20:32:54.2474403Z     compiled=False,
2025-05-07T20:32:54.2474474Z )
2025-05-07T20:32:54.2474686Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2474851Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.2474856Z 
2025-05-07T20:32:54.2474980Z     @given(
2025-05-07T20:32:54.2475099Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2475233Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2475347Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2475462Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2475574Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2475648Z     )
2025-05-07T20:32:54.2475891Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2475978Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2476098Z         self,
2025-05-07T20:32:54.2476173Z         T: int,
2025-05-07T20:32:54.2476247Z         D: int,
2025-05-07T20:32:54.2476344Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2476432Z         contiguous: bool,
2025-05-07T20:32:54.2476514Z         compiled: bool,
2025-05-07T20:32:54.2476590Z     ) -> None:
2025-05-07T20:32:54.2476682Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2476751Z     
2025-05-07T20:32:54.2476921Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2476991Z     
2025-05-07T20:32:54.2477125Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2477249Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2477337Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2477418Z         x0 = x[:, :D]
2025-05-07T20:32:54.2477495Z         x1 = x[:, D:]
2025-05-07T20:32:54.2477569Z     
2025-05-07T20:32:54.2477653Z         if contiguous:
2025-05-07T20:32:54.2477745Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2477828Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2477899Z     
2025-05-07T20:32:54.2477987Z         if scale_ub is not None:
2025-05-07T20:32:54.2478094Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2478225Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2478298Z             )
2025-05-07T20:32:54.2478379Z         else:
2025-05-07T20:32:54.2478469Z             scale_ub_tensor = None
2025-05-07T20:32:54.2478539Z     
2025-05-07T20:32:54.2478673Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2478759Z             op = silu_mul_quant
2025-05-07T20:32:54.2478838Z             if compiled:
2025-05-07T20:32:54.2478937Z                 op = torch.compile(op)
2025-05-07T20:32:54.2479039Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2479110Z     
2025-05-07T20:32:54.2479203Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2479207Z 
2025-05-07T20:32:54.2479301Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2479431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2479528Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2479623Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2480113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2480210Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2480563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2480783Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2481112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2481212Z     kernel = self.compile(
2025-05-07T20:32:54.2481585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2481757Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2481881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2481885Z 
2025-05-07T20:32:54.2482087Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f2964e0>
2025-05-07T20:32:54.2482970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2483463Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f231120>}
2025-05-07T20:32:54.2484189Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2484418Z context = <triton._C.libtriton.ir.context object at 0x7fc79f10bfb0>
2025-05-07T20:32:54.2484422Z 
2025-05-07T20:32:54.2484583Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2484838Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2484943Z                            module_map=module_map)
2025-05-07T20:32:54.2485137Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2485238Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2485312Z E       ^
2025-05-07T20:32:54.2485659Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2485663Z 
2025-05-07T20:32:54.2486067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2486075Z 
2025-05-07T20:32:54.2486173Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2486395Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2486469Z     T=16384,
2025-05-07T20:32:54.2486543Z     D=7168,
2025-05-07T20:32:54.2486623Z     scale_ub=1200.0,
2025-05-07T20:32:54.2486710Z     contiguous=False,
2025-05-07T20:32:54.2486791Z     compiled=True,
2025-05-07T20:32:54.2486862Z )
2025-05-07T20:32:54.2487080Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2487256Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.2487261Z 
2025-05-07T20:32:54.2487332Z     @given(
2025-05-07T20:32:54.2487447Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2487556Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2487667Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2487784Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2487897Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2487970Z     )
2025-05-07T20:32:54.2488210Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2488301Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2488373Z         self,
2025-05-07T20:32:54.2488451Z         T: int,
2025-05-07T20:32:54.2488527Z         D: int,
2025-05-07T20:32:54.2488623Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2488722Z         contiguous: bool,
2025-05-07T20:32:54.2488803Z         compiled: bool,
2025-05-07T20:32:54.2488880Z     ) -> None:
2025-05-07T20:32:54.2488973Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2489042Z     
2025-05-07T20:32:54.2489205Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2489281Z     
2025-05-07T20:32:54.2489368Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2489494Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2489581Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2489658Z         x0 = x[:, :D]
2025-05-07T20:32:54.2489733Z         x1 = x[:, D:]
2025-05-07T20:32:54.2489804Z     
2025-05-07T20:32:54.2489883Z         if contiguous:
2025-05-07T20:32:54.2489976Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2490108Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2490178Z     
2025-05-07T20:32:54.2490266Z         if scale_ub is not None:
2025-05-07T20:32:54.2490413Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2490545Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2490618Z             )
2025-05-07T20:32:54.2490696Z         else:
2025-05-07T20:32:54.2490786Z             scale_ub_tensor = None
2025-05-07T20:32:54.2490857Z     
2025-05-07T20:32:54.2490981Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2491065Z             op = silu_mul_quant
2025-05-07T20:32:54.2491194Z             if compiled:
2025-05-07T20:32:54.2491291Z                 op = torch.compile(op)
2025-05-07T20:32:54.2491396Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2491466Z     
2025-05-07T20:32:54.2491553Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2491558Z 
2025-05-07T20:32:54.2491653Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2491780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2491876Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2492015Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2492379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2492471Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2493012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2493110Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2493461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2493678Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2494007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2494102Z     kernel = self.compile(
2025-05-07T20:32:54.2494477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2494648Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2494769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2494774Z 
2025-05-07T20:32:54.2494975Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f2948f0>
2025-05-07T20:32:54.2495736Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2496230Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f232520>}
2025-05-07T20:32:54.2496965Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2497153Z context = <triton._C.libtriton.ir.context object at 0x7fc79f1e06b0>
2025-05-07T20:32:54.2497157Z 
2025-05-07T20:32:54.2497316Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2497572Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2497679Z                            module_map=module_map)
2025-05-07T20:32:54.2497838Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2497931Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2498005Z E       ^
2025-05-07T20:32:54.2498358Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2498411Z 
2025-05-07T20:32:54.2498852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2498862Z 
2025-05-07T20:32:54.2498965Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2499181Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2499254Z     T=1,
2025-05-07T20:32:54.2499328Z     D=7168,
2025-05-07T20:32:54.2499403Z     scale_ub=None,
2025-05-07T20:32:54.2499489Z     contiguous=False,
2025-05-07T20:32:54.2499578Z     compiled=False,
2025-05-07T20:32:54.2499689Z )
2025-05-07T20:32:54.2499901Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2500066Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.2500070Z 
2025-05-07T20:32:54.2500145Z     @given(
2025-05-07T20:32:54.2500264Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2500362Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2500472Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2500631Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2500743Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2500816Z     )
2025-05-07T20:32:54.2501059Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2501146Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2501219Z         self,
2025-05-07T20:32:54.2501295Z         T: int,
2025-05-07T20:32:54.2501372Z         D: int,
2025-05-07T20:32:54.2501465Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2501554Z         contiguous: bool,
2025-05-07T20:32:54.2501635Z         compiled: bool,
2025-05-07T20:32:54.2501715Z     ) -> None:
2025-05-07T20:32:54.2501805Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2501875Z     
2025-05-07T20:32:54.2502041Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2502116Z     
2025-05-07T20:32:54.2502203Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2502336Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2502422Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2502497Z         x0 = x[:, :D]
2025-05-07T20:32:54.2502577Z         x1 = x[:, D:]
2025-05-07T20:32:54.2502647Z     
2025-05-07T20:32:54.2502724Z         if contiguous:
2025-05-07T20:32:54.2502815Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2502901Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2502971Z     
2025-05-07T20:32:54.2503063Z         if scale_ub is not None:
2025-05-07T20:32:54.2503164Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2503296Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2503365Z             )
2025-05-07T20:32:54.2503438Z         else:
2025-05-07T20:32:54.2503528Z             scale_ub_tensor = None
2025-05-07T20:32:54.2503597Z     
2025-05-07T20:32:54.2503726Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2503813Z             op = silu_mul_quant
2025-05-07T20:32:54.2503896Z             if compiled:
2025-05-07T20:32:54.2503992Z                 op = torch.compile(op)
2025-05-07T20:32:54.2504094Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2504163Z     
2025-05-07T20:32:54.2504257Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2504261Z 
2025-05-07T20:32:54.2504353Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2504475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2504581Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2504676Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2505160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2505257Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2505655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2505916Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2506248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2506339Z     kernel = self.compile(
2025-05-07T20:32:54.2506713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2506882Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2507045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2507055Z 
2025-05-07T20:32:54.2507258Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f2972c0>
2025-05-07T20:32:54.2508014Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2508576Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f233100>}
2025-05-07T20:32:54.2509304Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2509496Z context = <triton._C.libtriton.ir.context object at 0x7fc79f171330>
2025-05-07T20:32:54.2509501Z 
2025-05-07T20:32:54.2509659Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2509911Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2510019Z                            module_map=module_map)
2025-05-07T20:32:54.2510177Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2510270Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2510348Z E       ^
2025-05-07T20:32:54.2510696Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2510701Z 
2025-05-07T20:32:54.2511108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2511112Z 
2025-05-07T20:32:54.2511212Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2511430Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2511505Z     T=2048,
2025-05-07T20:32:54.2511577Z     D=7168,
2025-05-07T20:32:54.2511658Z     scale_ub=None,
2025-05-07T20:32:54.2511740Z     contiguous=False,
2025-05-07T20:32:54.2511815Z     compiled=True,
2025-05-07T20:32:54.2511884Z )
2025-05-07T20:32:54.2512096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2512265Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.2512277Z 
2025-05-07T20:32:54.2512353Z     @given(
2025-05-07T20:32:54.2512467Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2512563Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2512675Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2512787Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2512900Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2512975Z     )
2025-05-07T20:32:54.2513212Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2513305Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2513378Z         self,
2025-05-07T20:32:54.2513449Z         T: int,
2025-05-07T20:32:54.2513525Z         D: int,
2025-05-07T20:32:54.2513620Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2513785Z         contiguous: bool,
2025-05-07T20:32:54.2513868Z         compiled: bool,
2025-05-07T20:32:54.2513942Z     ) -> None:
2025-05-07T20:32:54.2514076Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2514147Z     
2025-05-07T20:32:54.2514311Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2514388Z     
2025-05-07T20:32:54.2514480Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2514600Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2514688Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2514805Z         x0 = x[:, :D]
2025-05-07T20:32:54.2514879Z         x1 = x[:, D:]
2025-05-07T20:32:54.2514947Z     
2025-05-07T20:32:54.2515030Z         if contiguous:
2025-05-07T20:32:54.2515116Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2515205Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2515272Z     
2025-05-07T20:32:54.2515359Z         if scale_ub is not None:
2025-05-07T20:32:54.2515468Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2515596Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2515670Z             )
2025-05-07T20:32:54.2515788Z         else:
2025-05-07T20:32:54.2515879Z             scale_ub_tensor = None
2025-05-07T20:32:54.2515955Z     
2025-05-07T20:32:54.2516082Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2516169Z             op = silu_mul_quant
2025-05-07T20:32:54.2516254Z             if compiled:
2025-05-07T20:32:54.2516355Z                 op = torch.compile(op)
2025-05-07T20:32:54.2516461Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2516529Z     
2025-05-07T20:32:54.2516617Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2516622Z 
2025-05-07T20:32:54.2516712Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2516838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2516936Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2517041Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2517406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2517496Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2517980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2518072Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2518418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2518642Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2518971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2519063Z     kernel = self.compile(
2025-05-07T20:32:54.2519434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2519609Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2519739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2519744Z 
2025-05-07T20:32:54.2519947Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a0778c0>
2025-05-07T20:32:54.2520709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2521208Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a0f8720>}
2025-05-07T20:32:54.2521935Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2522211Z context = <triton._C.libtriton.ir.context object at 0x7fc88a0dcd70>
2025-05-07T20:32:54.2522216Z 
2025-05-07T20:32:54.2522378Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2522635Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2522738Z                            module_map=module_map)
2025-05-07T20:32:54.2522896Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2523118Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2523190Z E       ^
2025-05-07T20:32:54.2523534Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2523543Z 
2025-05-07T20:32:54.2523945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2523952Z 
2025-05-07T20:32:54.2524050Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2524312Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2524386Z     T=4096,
2025-05-07T20:32:54.2524461Z     D=7168,
2025-05-07T20:32:54.2524543Z     scale_ub=None,
2025-05-07T20:32:54.2524631Z     contiguous=False,
2025-05-07T20:32:54.2524710Z     compiled=True,
2025-05-07T20:32:54.2524787Z )
2025-05-07T20:32:54.2525000Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2525176Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.2525180Z 
2025-05-07T20:32:54.2525251Z     @given(
2025-05-07T20:32:54.2525367Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2525470Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2525579Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2525693Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2525805Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2525880Z     )
2025-05-07T20:32:54.2526125Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2526214Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2526288Z         self,
2025-05-07T20:32:54.2526367Z         T: int,
2025-05-07T20:32:54.2526438Z         D: int,
2025-05-07T20:32:54.2526534Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2526623Z         contiguous: bool,
2025-05-07T20:32:54.2526708Z         compiled: bool,
2025-05-07T20:32:54.2526781Z     ) -> None:
2025-05-07T20:32:54.2526879Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2526948Z     
2025-05-07T20:32:54.2527112Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2527182Z     
2025-05-07T20:32:54.2527270Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2527393Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2527481Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2527556Z         x0 = x[:, :D]
2025-05-07T20:32:54.2527639Z         x1 = x[:, D:]
2025-05-07T20:32:54.2527709Z     
2025-05-07T20:32:54.2527788Z         if contiguous:
2025-05-07T20:32:54.2527876Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2527959Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2528026Z     
2025-05-07T20:32:54.2528113Z         if scale_ub is not None:
2025-05-07T20:32:54.2528214Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2528344Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2528421Z             )
2025-05-07T20:32:54.2528492Z         else:
2025-05-07T20:32:54.2528582Z             scale_ub_tensor = None
2025-05-07T20:32:54.2528654Z     
2025-05-07T20:32:54.2528778Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2528865Z             op = silu_mul_quant
2025-05-07T20:32:54.2528994Z             if compiled:
2025-05-07T20:32:54.2529089Z                 op = torch.compile(op)
2025-05-07T20:32:54.2529234Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2529302Z     
2025-05-07T20:32:54.2529389Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2529393Z 
2025-05-07T20:32:54.2529489Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2529613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2529710Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2529808Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2530206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2530296Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2530774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2530870Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2531219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2531478Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2531808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2531900Z     kernel = self.compile(
2025-05-07T20:32:54.2532270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2532450Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2532574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2532578Z 
2025-05-07T20:32:54.2532779Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a074ad0>
2025-05-07T20:32:54.2533652Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2534150Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a0f9440>}
2025-05-07T20:32:54.2534880Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2535070Z context = <triton._C.libtriton.ir.context object at 0x7fc88a0ae7f0>
2025-05-07T20:32:54.2535074Z 
2025-05-07T20:32:54.2535238Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2535493Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2535598Z                            module_map=module_map)
2025-05-07T20:32:54.2535763Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2535861Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2535935Z E       ^
2025-05-07T20:32:54.2536284Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2536289Z 
2025-05-07T20:32:54.2536696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2536703Z 
2025-05-07T20:32:54.2536803Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2537018Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2537092Z     T=16384,
2025-05-07T20:32:54.2537165Z     D=5120,
2025-05-07T20:32:54.2537241Z     scale_ub=1200.0,
2025-05-07T20:32:54.2537324Z     contiguous=False,
2025-05-07T20:32:54.2537404Z     compiled=False,
2025-05-07T20:32:54.2537520Z )
2025-05-07T20:32:54.2537729Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2537951Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.2537956Z 
2025-05-07T20:32:54.2538031Z     @given(
2025-05-07T20:32:54.2538149Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2538244Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2538355Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2538473Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2538646Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2538719Z     )
2025-05-07T20:32:54.2538964Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2539050Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2539123Z         self,
2025-05-07T20:32:54.2539195Z         T: int,
2025-05-07T20:32:54.2539267Z         D: int,
2025-05-07T20:32:54.2539364Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2539450Z         contiguous: bool,
2025-05-07T20:32:54.2539572Z         compiled: bool,
2025-05-07T20:32:54.2539650Z     ) -> None:
2025-05-07T20:32:54.2539743Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2539812Z     
2025-05-07T20:32:54.2539978Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2540046Z     
2025-05-07T20:32:54.2540132Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2540254Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2540343Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2540419Z         x0 = x[:, :D]
2025-05-07T20:32:54.2540500Z         x1 = x[:, D:]
2025-05-07T20:32:54.2540569Z     
2025-05-07T20:32:54.2540653Z         if contiguous:
2025-05-07T20:32:54.2540741Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2540826Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2540900Z     
2025-05-07T20:32:54.2540990Z         if scale_ub is not None:
2025-05-07T20:32:54.2541091Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2541231Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2541301Z             )
2025-05-07T20:32:54.2541372Z         else:
2025-05-07T20:32:54.2541466Z             scale_ub_tensor = None
2025-05-07T20:32:54.2541538Z     
2025-05-07T20:32:54.2541662Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2541751Z             op = silu_mul_quant
2025-05-07T20:32:54.2541830Z             if compiled:
2025-05-07T20:32:54.2541931Z                 op = torch.compile(op)
2025-05-07T20:32:54.2542030Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2542101Z     
2025-05-07T20:32:54.2542191Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2542195Z 
2025-05-07T20:32:54.2542288Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2542412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2542518Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2542612Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2543146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2546448Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2546821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2547044Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2547382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2547475Z     kernel = self.compile(
2025-05-07T20:32:54.2547851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2548021Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2548212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2548266Z 
2025-05-07T20:32:54.2548473Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a077ce0>
2025-05-07T20:32:54.2549231Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2549731Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a0fa340>}
2025-05-07T20:32:54.2550502Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2550694Z context = <triton._C.libtriton.ir.context object at 0x7fc79ef28770>
2025-05-07T20:32:54.2550699Z 
2025-05-07T20:32:54.2550900Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2551155Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2551264Z                            module_map=module_map)
2025-05-07T20:32:54.2551423Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2551516Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2551590Z E       ^
2025-05-07T20:32:54.2551940Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2551945Z 
2025-05-07T20:32:54.2552353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2552358Z 
2025-05-07T20:32:54.2552459Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2552678Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2552753Z     T=16384,
2025-05-07T20:32:54.2552835Z     D=5120,
2025-05-07T20:32:54.2552917Z     scale_ub=1200.0,
2025-05-07T20:32:54.2552997Z     contiguous=True,
2025-05-07T20:32:54.2553074Z     compiled=True,
2025-05-07T20:32:54.2553147Z )
2025-05-07T20:32:54.2553363Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2553531Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.2553539Z 
2025-05-07T20:32:54.2553619Z     @given(
2025-05-07T20:32:54.2553734Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2553831Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2553948Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2554064Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2554179Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2554254Z     )
2025-05-07T20:32:54.2554494Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2554590Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2554663Z         self,
2025-05-07T20:32:54.2554738Z         T: int,
2025-05-07T20:32:54.2554814Z         D: int,
2025-05-07T20:32:54.2554908Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2554992Z         contiguous: bool,
2025-05-07T20:32:54.2555080Z         compiled: bool,
2025-05-07T20:32:54.2555155Z     ) -> None:
2025-05-07T20:32:54.2555250Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2555323Z     
2025-05-07T20:32:54.2555487Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2555563Z     
2025-05-07T20:32:54.2555652Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2555772Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2555861Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2555987Z         x0 = x[:, :D]
2025-05-07T20:32:54.2556063Z         x1 = x[:, D:]
2025-05-07T20:32:54.2556136Z     
2025-05-07T20:32:54.2556257Z         if contiguous:
2025-05-07T20:32:54.2556347Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2556435Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2556503Z     
2025-05-07T20:32:54.2556588Z         if scale_ub is not None:
2025-05-07T20:32:54.2556694Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2556823Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2556896Z             )
2025-05-07T20:32:54.2557014Z         else:
2025-05-07T20:32:54.2557104Z             scale_ub_tensor = None
2025-05-07T20:32:54.2557176Z     
2025-05-07T20:32:54.2557301Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2557385Z             op = silu_mul_quant
2025-05-07T20:32:54.2557467Z             if compiled:
2025-05-07T20:32:54.2557565Z                 op = torch.compile(op)
2025-05-07T20:32:54.2557670Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2557740Z     
2025-05-07T20:32:54.2557826Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2557872Z 
2025-05-07T20:32:54.2557965Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2558093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2558192Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2558292Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2558653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2558743Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2559511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2559612Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2559963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2560191Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2560527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2560619Z     kernel = self.compile(
2025-05-07T20:32:54.2560991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2561163Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2561295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2561299Z 
2025-05-07T20:32:54.2561501Z self = <triton.compiler.compiler.ASTSource object at 0x7fc88a077e30>
2025-05-07T20:32:54.2562262Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2562763Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc88a0fb9c0>}
2025-05-07T20:32:54.2563493Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2563684Z context = <triton._C.libtriton.ir.context object at 0x7fc79f3502b0>
2025-05-07T20:32:54.2563691Z 
2025-05-07T20:32:54.2563849Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2564109Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2564212Z                            module_map=module_map)
2025-05-07T20:32:54.2564366Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2564547Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2564619Z E       ^
2025-05-07T20:32:54.2565031Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2565036Z 
2025-05-07T20:32:54.2565442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2565447Z 
2025-05-07T20:32:54.2565548Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2565768Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2565903Z     T=16384,
2025-05-07T20:32:54.2565978Z     D=5120,
2025-05-07T20:32:54.2566063Z     scale_ub=None,
2025-05-07T20:32:54.2566148Z     contiguous=False,
2025-05-07T20:32:54.2566228Z     compiled=True,
2025-05-07T20:32:54.2566295Z )
2025-05-07T20:32:54.2566508Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2566687Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.2566692Z 
2025-05-07T20:32:54.2566769Z     @given(
2025-05-07T20:32:54.2566940Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2567041Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2567151Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2567263Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2567376Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2567450Z     )
2025-05-07T20:32:54.2567693Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2567782Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2567856Z         self,
2025-05-07T20:32:54.2567935Z         T: int,
2025-05-07T20:32:54.2568010Z         D: int,
2025-05-07T20:32:54.2568106Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2568194Z         contiguous: bool,
2025-05-07T20:32:54.2568277Z         compiled: bool,
2025-05-07T20:32:54.2568351Z     ) -> None:
2025-05-07T20:32:54.2568446Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2568517Z     
2025-05-07T20:32:54.2568680Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2568757Z     
2025-05-07T20:32:54.2568849Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2568975Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2569059Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2569135Z         x0 = x[:, :D]
2025-05-07T20:32:54.2569217Z         x1 = x[:, D:]
2025-05-07T20:32:54.2569291Z     
2025-05-07T20:32:54.2569372Z         if contiguous:
2025-05-07T20:32:54.2569464Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2569549Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2569618Z     
2025-05-07T20:32:54.2569706Z         if scale_ub is not None:
2025-05-07T20:32:54.2569806Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2569939Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2570014Z             )
2025-05-07T20:32:54.2570089Z         else:
2025-05-07T20:32:54.2570182Z             scale_ub_tensor = None
2025-05-07T20:32:54.2570254Z     
2025-05-07T20:32:54.2570379Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2570467Z             op = silu_mul_quant
2025-05-07T20:32:54.2570552Z             if compiled:
2025-05-07T20:32:54.2570646Z                 op = torch.compile(op)
2025-05-07T20:32:54.2570753Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2570830Z     
2025-05-07T20:32:54.2570918Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2570922Z 
2025-05-07T20:32:54.2571019Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2571143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2571244Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2571340Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2571748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2571906Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2572397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2572492Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2572841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2573151Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2573484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2573573Z     kernel = self.compile(
2025-05-07T20:32:54.2573943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2574118Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2574285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2574290Z 
2025-05-07T20:32:54.2574494Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f3d7920>
2025-05-07T20:32:54.2575255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2575756Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f34cc20>}
2025-05-07T20:32:54.2576487Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2576677Z context = <triton._C.libtriton.ir.context object at 0x7fc79f3660b0>
2025-05-07T20:32:54.2576684Z 
2025-05-07T20:32:54.2576850Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2577103Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2577210Z                            module_map=module_map)
2025-05-07T20:32:54.2577368Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2577465Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2577538Z E       ^
2025-05-07T20:32:54.2577886Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2577891Z 
2025-05-07T20:32:54.2578299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2578307Z 
2025-05-07T20:32:54.2578412Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2578632Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2578709Z     T=2048,
2025-05-07T20:32:54.2578785Z     D=5120,
2025-05-07T20:32:54.2578866Z     scale_ub=None,
2025-05-07T20:32:54.2578950Z     contiguous=False,
2025-05-07T20:32:54.2579031Z     compiled=True,
2025-05-07T20:32:54.2579100Z )
2025-05-07T20:32:54.2579316Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2579482Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.2579489Z 
2025-05-07T20:32:54.2579564Z     @given(
2025-05-07T20:32:54.2579680Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2579775Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2579885Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2580000Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2580157Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2580225Z     )
2025-05-07T20:32:54.2580509Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2580601Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2580674Z         self,
2025-05-07T20:32:54.2580745Z         T: int,
2025-05-07T20:32:54.2580819Z         D: int,
2025-05-07T20:32:54.2580920Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2581006Z         contiguous: bool,
2025-05-07T20:32:54.2581088Z         compiled: bool,
2025-05-07T20:32:54.2581206Z     ) -> None:
2025-05-07T20:32:54.2581297Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2581368Z     
2025-05-07T20:32:54.2581534Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2581607Z     
2025-05-07T20:32:54.2581694Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2581816Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2581904Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2581984Z         x0 = x[:, :D]
2025-05-07T20:32:54.2582060Z         x1 = x[:, D:]
2025-05-07T20:32:54.2582175Z     
2025-05-07T20:32:54.2582257Z         if contiguous:
2025-05-07T20:32:54.2582343Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2582428Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2582501Z     
2025-05-07T20:32:54.2582589Z         if scale_ub is not None:
2025-05-07T20:32:54.2582691Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2582827Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2582902Z             )
2025-05-07T20:32:54.2582980Z         else:
2025-05-07T20:32:54.2583075Z             scale_ub_tensor = None
2025-05-07T20:32:54.2583145Z     
2025-05-07T20:32:54.2583270Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2583357Z             op = silu_mul_quant
2025-05-07T20:32:54.2583438Z             if compiled:
2025-05-07T20:32:54.2583539Z                 op = torch.compile(op)
2025-05-07T20:32:54.2583640Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2583710Z     
2025-05-07T20:32:54.2583806Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2583810Z 
2025-05-07T20:32:54.2583902Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2584025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2584125Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2584221Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2584578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2584678Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2585158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2585254Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2585604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2585824Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2586158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2586247Z     kernel = self.compile(
2025-05-07T20:32:54.2586621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2586790Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2586915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2586919Z 
2025-05-07T20:32:54.2587123Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f3d6030>
2025-05-07T20:32:54.2587879Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2588464Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f34d9e0>}
2025-05-07T20:32:54.2589193Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2589420Z context = <triton._C.libtriton.ir.context object at 0x7fc79ed751b0>
2025-05-07T20:32:54.2589425Z 
2025-05-07T20:32:54.2589586Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2589837Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2589943Z                            module_map=module_map)
2025-05-07T20:32:54.2590101Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2590198Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2590316Z E       ^
2025-05-07T20:32:54.2590666Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2590671Z 
2025-05-07T20:32:54.2591077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2591082Z 
2025-05-07T20:32:54.2591178Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2591398Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2591477Z     T=2048,
2025-05-07T20:32:54.2591553Z     D=5120,
2025-05-07T20:32:54.2591632Z     scale_ub=1200.0,
2025-05-07T20:32:54.2591717Z     contiguous=False,
2025-05-07T20:32:54.2591799Z     compiled=True,
2025-05-07T20:32:54.2591868Z )
2025-05-07T20:32:54.2592087Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2592256Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.2592264Z 
2025-05-07T20:32:54.2592341Z     @given(
2025-05-07T20:32:54.2592456Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2592553Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2592668Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2592779Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2592888Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2592960Z     )
2025-05-07T20:32:54.2593198Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2593288Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2593367Z         self,
2025-05-07T20:32:54.2593440Z         T: int,
2025-05-07T20:32:54.2593520Z         D: int,
2025-05-07T20:32:54.2593616Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2593703Z         contiguous: bool,
2025-05-07T20:32:54.2593786Z         compiled: bool,
2025-05-07T20:32:54.2593861Z     ) -> None:
2025-05-07T20:32:54.2593956Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2594028Z     
2025-05-07T20:32:54.2594194Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2594267Z     
2025-05-07T20:32:54.2594356Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2594475Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2594560Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2594640Z         x0 = x[:, :D]
2025-05-07T20:32:54.2594718Z         x1 = x[:, D:]
2025-05-07T20:32:54.2594792Z     
2025-05-07T20:32:54.2594876Z         if contiguous:
2025-05-07T20:32:54.2594963Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2595049Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2595118Z     
2025-05-07T20:32:54.2595205Z         if scale_ub is not None:
2025-05-07T20:32:54.2595356Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2595484Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2595601Z             )
2025-05-07T20:32:54.2595678Z         else:
2025-05-07T20:32:54.2595769Z             scale_ub_tensor = None
2025-05-07T20:32:54.2595839Z     
2025-05-07T20:32:54.2595963Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2596048Z             op = silu_mul_quant
2025-05-07T20:32:54.2596133Z             if compiled:
2025-05-07T20:32:54.2596229Z                 op = torch.compile(op)
2025-05-07T20:32:54.2596370Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2596445Z     
2025-05-07T20:32:54.2596534Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2596538Z 
2025-05-07T20:32:54.2596636Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2596761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2596857Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2596958Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2597358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2597447Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2597932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2598025Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2598376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2598596Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2598925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2599019Z     kernel = self.compile(
2025-05-07T20:32:54.2599390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2599563Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2599693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2599697Z 
2025-05-07T20:32:54.2599899Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f3d5d00>
2025-05-07T20:32:54.2600659Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2601155Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f34eb60>}
2025-05-07T20:32:54.2601885Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2602077Z context = <triton._C.libtriton.ir.context object at 0x7fc79f0815b0>
2025-05-07T20:32:54.2602081Z 
2025-05-07T20:32:54.2602240Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2602495Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2602600Z                            module_map=module_map)
2025-05-07T20:32:54.2602756Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2602856Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2602928Z E       ^
2025-05-07T20:32:54.2603281Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2603286Z 
2025-05-07T20:32:54.2603686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2603758Z 
2025-05-07T20:32:54.2603856Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2604117Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2604192Z     T=4096,
2025-05-07T20:32:54.2604267Z     D=5120,
2025-05-07T20:32:54.2604345Z     scale_ub=1200.0,
2025-05-07T20:32:54.2604425Z     contiguous=True,
2025-05-07T20:32:54.2604508Z     compiled=True,
2025-05-07T20:32:54.2604580Z )
2025-05-07T20:32:54.2604792Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2605001Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.2605006Z 
2025-05-07T20:32:54.2605079Z     @given(
2025-05-07T20:32:54.2605191Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2605293Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2605406Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2605523Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2605634Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2605745Z     )
2025-05-07T20:32:54.2605986Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2606075Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2606149Z         self,
2025-05-07T20:32:54.2606225Z         T: int,
2025-05-07T20:32:54.2606295Z         D: int,
2025-05-07T20:32:54.2606389Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2606490Z         contiguous: bool,
2025-05-07T20:32:54.2606572Z         compiled: bool,
2025-05-07T20:32:54.2606646Z     ) -> None:
2025-05-07T20:32:54.2606739Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2606806Z     
2025-05-07T20:32:54.2606971Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2607041Z     
2025-05-07T20:32:54.2607130Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2607253Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2607340Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2607417Z         x0 = x[:, :D]
2025-05-07T20:32:54.2607504Z         x1 = x[:, D:]
2025-05-07T20:32:54.2607573Z     
2025-05-07T20:32:54.2607654Z         if contiguous:
2025-05-07T20:32:54.2607743Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2607832Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2607902Z     
2025-05-07T20:32:54.2607991Z         if scale_ub is not None:
2025-05-07T20:32:54.2608092Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2608225Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2608298Z             )
2025-05-07T20:32:54.2608370Z         else:
2025-05-07T20:32:54.2608461Z             scale_ub_tensor = None
2025-05-07T20:32:54.2608535Z     
2025-05-07T20:32:54.2608662Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2608752Z             op = silu_mul_quant
2025-05-07T20:32:54.2608841Z             if compiled:
2025-05-07T20:32:54.2608936Z                 op = torch.compile(op)
2025-05-07T20:32:54.2609045Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2609115Z     
2025-05-07T20:32:54.2609200Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2609205Z 
2025-05-07T20:32:54.2609300Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2609423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2609519Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2609616Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2609977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2610066Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2610547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2610688Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2611084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2611305Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2611634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2611726Z     kernel = self.compile(
2025-05-07T20:32:54.2612096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2612308Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2612429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2612434Z 
2025-05-07T20:32:54.2612633Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f0cfb30>
2025-05-07T20:32:54.2613509Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2614007Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f4ec180>}
2025-05-07T20:32:54.2614739Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2614928Z context = <triton._C.libtriton.ir.context object at 0x7fc79f478fb0>
2025-05-07T20:32:54.2614933Z 
2025-05-07T20:32:54.2615097Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2615349Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2615455Z                            module_map=module_map)
2025-05-07T20:32:54.2615613Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2615711Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2615785Z E       ^
2025-05-07T20:32:54.2616134Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2616138Z 
2025-05-07T20:32:54.2616541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2616548Z 
2025-05-07T20:32:54.2616647Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2616864Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2616939Z     T=128,
2025-05-07T20:32:54.2617016Z     D=5120,
2025-05-07T20:32:54.2617096Z     scale_ub=1200.0,
2025-05-07T20:32:54.2617178Z     contiguous=False,
2025-05-07T20:32:54.2617261Z     compiled=True,
2025-05-07T20:32:54.2617332Z )
2025-05-07T20:32:54.2617544Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2617718Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.2617722Z 
2025-05-07T20:32:54.2617793Z     @given(
2025-05-07T20:32:54.2617912Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2618007Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2618117Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2618233Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2618345Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2618417Z     )
2025-05-07T20:32:54.2618657Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2618746Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2618819Z         self,
2025-05-07T20:32:54.2618897Z         T: int,
2025-05-07T20:32:54.2619015Z         D: int,
2025-05-07T20:32:54.2619112Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2619198Z         contiguous: bool,
2025-05-07T20:32:54.2619321Z         compiled: bool,
2025-05-07T20:32:54.2619400Z     ) -> None:
2025-05-07T20:32:54.2619493Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2619565Z     
2025-05-07T20:32:54.2619734Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2619804Z     
2025-05-07T20:32:54.2619892Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2620017Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2620142Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2620219Z         x0 = x[:, :D]
2025-05-07T20:32:54.2620297Z         x1 = x[:, D:]
2025-05-07T20:32:54.2620367Z     
2025-05-07T20:32:54.2620453Z         if contiguous:
2025-05-07T20:32:54.2620539Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2620624Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2620700Z     
2025-05-07T20:32:54.2620785Z         if scale_ub is not None:
2025-05-07T20:32:54.2620886Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2621065Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2621137Z             )
2025-05-07T20:32:54.2621210Z         else:
2025-05-07T20:32:54.2621303Z             scale_ub_tensor = None
2025-05-07T20:32:54.2621374Z     
2025-05-07T20:32:54.2621499Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2621590Z             op = silu_mul_quant
2025-05-07T20:32:54.2621673Z             if compiled:
2025-05-07T20:32:54.2621771Z                 op = torch.compile(op)
2025-05-07T20:32:54.2621877Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2621948Z     
2025-05-07T20:32:54.2622036Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2622040Z 
2025-05-07T20:32:54.2622131Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2622254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2622356Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2622453Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2622846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2622947Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2623444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2623539Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2623889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2624106Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2624442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2624531Z     kernel = self.compile(
2025-05-07T20:32:54.2624904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2625081Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2625205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2625209Z 
2025-05-07T20:32:54.2625412Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f0cd340>
2025-05-07T20:32:54.2626165Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2626660Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f4ecea0>}
2025-05-07T20:32:54.2627435Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2627655Z context = <triton._C.libtriton.ir.context object at 0x7fc79f45b830>
2025-05-07T20:32:54.2627659Z 
2025-05-07T20:32:54.2627823Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2628074Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2628179Z                            module_map=module_map)
2025-05-07T20:32:54.2628377Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2628471Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2628549Z E       ^
2025-05-07T20:32:54.2628895Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2628899Z 
2025-05-07T20:32:54.2629303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2629313Z 
2025-05-07T20:32:54.2629476Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2629692Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2629770Z     T=16384,
2025-05-07T20:32:54.2629842Z     D=7168,
2025-05-07T20:32:54.2629919Z     scale_ub=1200.0,
2025-05-07T20:32:54.2630008Z     contiguous=True,
2025-05-07T20:32:54.2630086Z     compiled=True,
2025-05-07T20:32:54.2630156Z )
2025-05-07T20:32:54.2630373Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2630546Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.2630550Z 
2025-05-07T20:32:54.2630624Z     @given(
2025-05-07T20:32:54.2630739Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2630836Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2630953Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2631067Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2631180Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2631255Z     )
2025-05-07T20:32:54.2631494Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2631581Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2631661Z         self,
2025-05-07T20:32:54.2631735Z         T: int,
2025-05-07T20:32:54.2631808Z         D: int,
2025-05-07T20:32:54.2631908Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2631995Z         contiguous: bool,
2025-05-07T20:32:54.2632079Z         compiled: bool,
2025-05-07T20:32:54.2632151Z     ) -> None:
2025-05-07T20:32:54.2632242Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2632317Z     
2025-05-07T20:32:54.2632483Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2632555Z     
2025-05-07T20:32:54.2632647Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2632767Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2632856Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2632952Z         x0 = x[:, :D]
2025-05-07T20:32:54.2633038Z         x1 = x[:, D:]
2025-05-07T20:32:54.2633118Z     
2025-05-07T20:32:54.2633212Z         if contiguous:
2025-05-07T20:32:54.2633299Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2633384Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2633458Z     
2025-05-07T20:32:54.2633544Z         if scale_ub is not None:
2025-05-07T20:32:54.2633650Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2633778Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2633848Z             )
2025-05-07T20:32:54.2633924Z         else:
2025-05-07T20:32:54.2634017Z             scale_ub_tensor = None
2025-05-07T20:32:54.2634086Z     
2025-05-07T20:32:54.2634216Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2634353Z             op = silu_mul_quant
2025-05-07T20:32:54.2634435Z             if compiled:
2025-05-07T20:32:54.2634579Z                 op = torch.compile(op)
2025-05-07T20:32:54.2634680Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2634750Z     
2025-05-07T20:32:54.2634841Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2634846Z 
2025-05-07T20:32:54.2634938Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2635065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2635203Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2635303Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2635668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2635760Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2636242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2636340Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2636731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2636952Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2637280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2637368Z     kernel = self.compile(
2025-05-07T20:32:54.2637745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2637917Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2638043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2638048Z 
2025-05-07T20:32:54.2638249Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f0cd070>
2025-05-07T20:32:54.2639012Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2639507Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f4ee0c0>}
2025-05-07T20:32:54.2640234Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2640427Z context = <triton._C.libtriton.ir.context object at 0x7fc79ef42f70>
2025-05-07T20:32:54.2640432Z 
2025-05-07T20:32:54.2640590Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2640843Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2640952Z                            module_map=module_map)
2025-05-07T20:32:54.2641115Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2641216Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2641289Z E       ^
2025-05-07T20:32:54.2641633Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2641638Z 
2025-05-07T20:32:54.2642046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2642054Z 
2025-05-07T20:32:54.2642153Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2642371Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2642445Z     T=16384,
2025-05-07T20:32:54.2642520Z     D=5120,
2025-05-07T20:32:54.2642603Z     scale_ub=1200.0,
2025-05-07T20:32:54.2642729Z     contiguous=True,
2025-05-07T20:32:54.2642810Z     compiled=False,
2025-05-07T20:32:54.2642880Z )
2025-05-07T20:32:54.2643177Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2643356Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.2643360Z 
2025-05-07T20:32:54.2643436Z     @given(
2025-05-07T20:32:54.2643551Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2643650Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2643765Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2643920Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2644035Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2644104Z     )
2025-05-07T20:32:54.2644343Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2644437Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2644513Z         self,
2025-05-07T20:32:54.2644589Z         T: int,
2025-05-07T20:32:54.2644666Z         D: int,
2025-05-07T20:32:54.2644762Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2644886Z         contiguous: bool,
2025-05-07T20:32:54.2644972Z         compiled: bool,
2025-05-07T20:32:54.2645047Z     ) -> None:
2025-05-07T20:32:54.2645141Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2645210Z     
2025-05-07T20:32:54.2645374Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2645450Z     
2025-05-07T20:32:54.2645539Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2645661Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2645748Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2645825Z         x0 = x[:, :D]
2025-05-07T20:32:54.2645900Z         x1 = x[:, D:]
2025-05-07T20:32:54.2645975Z     
2025-05-07T20:32:54.2646054Z         if contiguous:
2025-05-07T20:32:54.2646142Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2646236Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2646306Z     
2025-05-07T20:32:54.2646393Z         if scale_ub is not None:
2025-05-07T20:32:54.2646503Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2646633Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2646708Z             )
2025-05-07T20:32:54.2646780Z         else:
2025-05-07T20:32:54.2646869Z             scale_ub_tensor = None
2025-05-07T20:32:54.2646942Z     
2025-05-07T20:32:54.2647066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2647154Z             op = silu_mul_quant
2025-05-07T20:32:54.2647237Z             if compiled:
2025-05-07T20:32:54.2647333Z                 op = torch.compile(op)
2025-05-07T20:32:54.2647432Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2647507Z     
2025-05-07T20:32:54.2647592Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2647596Z 
2025-05-07T20:32:54.2647690Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2647816Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2647914Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2648016Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2648505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2648602Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2648952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2649172Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2649503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2649595Z     kernel = self.compile(
2025-05-07T20:32:54.2649967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2650189Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2650432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2650437Z 
2025-05-07T20:32:54.2650639Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79f0ce6f0>
2025-05-07T20:32:54.2651396Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2651924Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79f4eda80>}
2025-05-07T20:32:54.2652657Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2652847Z context = <triton._C.libtriton.ir.context object at 0x7fc79ef904f0>
2025-05-07T20:32:54.2652889Z 
2025-05-07T20:32:54.2653098Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2653350Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2653453Z                            module_map=module_map)
2025-05-07T20:32:54.2653612Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2653709Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2653789Z E       ^
2025-05-07T20:32:54.2654136Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2654140Z 
2025-05-07T20:32:54.2654543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2654550Z 
2025-05-07T20:32:54.2654649Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2654869Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2654943Z     T=1,
2025-05-07T20:32:54.2655018Z     D=7168,
2025-05-07T20:32:54.2655095Z     scale_ub=1200.0,
2025-05-07T20:32:54.2655180Z     contiguous=False,
2025-05-07T20:32:54.2655265Z     compiled=False,
2025-05-07T20:32:54.2655335Z )
2025-05-07T20:32:54.2655551Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2655712Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.2655719Z 
2025-05-07T20:32:54.2655793Z     @given(
2025-05-07T20:32:54.2655910Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2656006Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2656114Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2656229Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2656341Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2656409Z     )
2025-05-07T20:32:54.2656654Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2656742Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2656817Z         self,
2025-05-07T20:32:54.2656889Z         T: int,
2025-05-07T20:32:54.2656959Z         D: int,
2025-05-07T20:32:54.2657058Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2657144Z         contiguous: bool,
2025-05-07T20:32:54.2657226Z         compiled: bool,
2025-05-07T20:32:54.2657306Z     ) -> None:
2025-05-07T20:32:54.2657397Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2657471Z     
2025-05-07T20:32:54.2657640Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2657708Z     
2025-05-07T20:32:54.2657794Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2657916Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2658048Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2658126Z         x0 = x[:, :D]
2025-05-07T20:32:54.2658244Z         x1 = x[:, D:]
2025-05-07T20:32:54.2658315Z     
2025-05-07T20:32:54.2658397Z         if contiguous:
2025-05-07T20:32:54.2658484Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2658569Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2658643Z     
2025-05-07T20:32:54.2658730Z         if scale_ub is not None:
2025-05-07T20:32:54.2658831Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2658963Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2659096Z             )
2025-05-07T20:32:54.2659437Z         else:
2025-05-07T20:32:54.2659536Z             scale_ub_tensor = None
2025-05-07T20:32:54.2659605Z     
2025-05-07T20:32:54.2659735Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2659820Z             op = silu_mul_quant
2025-05-07T20:32:54.2659900Z             if compiled:
2025-05-07T20:32:54.2660002Z                 op = torch.compile(op)
2025-05-07T20:32:54.2660103Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2660172Z     
2025-05-07T20:32:54.2660330Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2660335Z 
2025-05-07T20:32:54.2660429Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2660556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2660653Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2660747Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2661239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2661337Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2661686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2661903Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2662238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2662331Z     kernel = self.compile(
2025-05-07T20:32:54.2662705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2662873Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2663000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2663008Z 
2025-05-07T20:32:54.2663207Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79ef15bb0>
2025-05-07T20:32:54.2663963Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2664468Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79ef680e0>}
2025-05-07T20:32:54.2668353Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2668561Z context = <triton._C.libtriton.ir.context object at 0x7fc79ec35fb0>
2025-05-07T20:32:54.2668566Z 
2025-05-07T20:32:54.2668729Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2668986Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2669093Z                            module_map=module_map)
2025-05-07T20:32:54.2669249Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2669346Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2669514Z E       ^
2025-05-07T20:32:54.2669863Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2669929Z 
2025-05-07T20:32:54.2670347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2670352Z 
2025-05-07T20:32:54.2670452Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2670675Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2670746Z     T=4096,
2025-05-07T20:32:54.2670889Z     D=7168,
2025-05-07T20:32:54.2670974Z     scale_ub=1200.0,
2025-05-07T20:32:54.2671058Z     contiguous=False,
2025-05-07T20:32:54.2671139Z     compiled=True,
2025-05-07T20:32:54.2671213Z )
2025-05-07T20:32:54.2671425Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2671593Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.2671601Z 
2025-05-07T20:32:54.2671681Z     @given(
2025-05-07T20:32:54.2671799Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2671936Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2672051Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2672163Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2672274Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2672343Z     )
2025-05-07T20:32:54.2672582Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2672676Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2672752Z         self,
2025-05-07T20:32:54.2672824Z         T: int,
2025-05-07T20:32:54.2672903Z         D: int,
2025-05-07T20:32:54.2672997Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2673083Z         contiguous: bool,
2025-05-07T20:32:54.2673168Z         compiled: bool,
2025-05-07T20:32:54.2673245Z     ) -> None:
2025-05-07T20:32:54.2673339Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2673411Z     
2025-05-07T20:32:54.2673578Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2673659Z     
2025-05-07T20:32:54.2673748Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2673873Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2673963Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2674041Z         x0 = x[:, :D]
2025-05-07T20:32:54.2674116Z         x1 = x[:, D:]
2025-05-07T20:32:54.2674186Z     
2025-05-07T20:32:54.2674265Z         if contiguous:
2025-05-07T20:32:54.2674356Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2674443Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2674511Z     
2025-05-07T20:32:54.2674600Z         if scale_ub is not None:
2025-05-07T20:32:54.2674704Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2674835Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2674914Z             )
2025-05-07T20:32:54.2674988Z         else:
2025-05-07T20:32:54.2675078Z             scale_ub_tensor = None
2025-05-07T20:32:54.2675152Z     
2025-05-07T20:32:54.2675282Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2675370Z             op = silu_mul_quant
2025-05-07T20:32:54.2675454Z             if compiled:
2025-05-07T20:32:54.2675553Z                 op = torch.compile(op)
2025-05-07T20:32:54.2675653Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2675723Z     
2025-05-07T20:32:54.2675809Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2675817Z 
2025-05-07T20:32:54.2675910Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2676037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2676135Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2676231Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2676590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2676728Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2677256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2677351Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2677700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2677922Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2678294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2678386Z     kernel = self.compile(
2025-05-07T20:32:54.2678756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2678925Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2679054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2679058Z 
2025-05-07T20:32:54.2679298Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79ef14f80>
2025-05-07T20:32:54.2680060Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2680552Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79ef69300>}
2025-05-07T20:32:54.2681292Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2681478Z context = <triton._C.libtriton.ir.context object at 0x7fc79ee942f0>
2025-05-07T20:32:54.2681485Z 
2025-05-07T20:32:54.2681644Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2681907Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2682012Z                            module_map=module_map)
2025-05-07T20:32:54.2682170Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2682265Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2682339Z E       ^
2025-05-07T20:32:54.2682685Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2682693Z 
2025-05-07T20:32:54.2683134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2683140Z 
2025-05-07T20:32:54.2683252Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2683477Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2683552Z     T=128,
2025-05-07T20:32:54.2683630Z     D=7168,
2025-05-07T20:32:54.2683715Z     scale_ub=1200.0,
2025-05-07T20:32:54.2683796Z     contiguous=False,
2025-05-07T20:32:54.2683878Z     compiled=True,
2025-05-07T20:32:54.2683946Z )
2025-05-07T20:32:54.2684157Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2684328Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.2684333Z 
2025-05-07T20:32:54.2684410Z     @given(
2025-05-07T20:32:54.2684524Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2684620Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2684733Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2684850Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2684960Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2685080Z     )
2025-05-07T20:32:54.2685320Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2685449Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2685531Z         self,
2025-05-07T20:32:54.2685607Z         T: int,
2025-05-07T20:32:54.2685682Z         D: int,
2025-05-07T20:32:54.2685778Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2685866Z         contiguous: bool,
2025-05-07T20:32:54.2685947Z         compiled: bool,
2025-05-07T20:32:54.2686019Z     ) -> None:
2025-05-07T20:32:54.2686113Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2686223Z     
2025-05-07T20:32:54.2686386Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2686457Z     
2025-05-07T20:32:54.2686548Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2686672Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2686758Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2686835Z         x0 = x[:, :D]
2025-05-07T20:32:54.2686916Z         x1 = x[:, D:]
2025-05-07T20:32:54.2686986Z     
2025-05-07T20:32:54.2687066Z         if contiguous:
2025-05-07T20:32:54.2687199Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2687285Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2687354Z     
2025-05-07T20:32:54.2687444Z         if scale_ub is not None:
2025-05-07T20:32:54.2687548Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2687677Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2687750Z             )
2025-05-07T20:32:54.2687821Z         else:
2025-05-07T20:32:54.2687917Z             scale_ub_tensor = None
2025-05-07T20:32:54.2687986Z     
2025-05-07T20:32:54.2688111Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2688202Z             op = silu_mul_quant
2025-05-07T20:32:54.2688282Z             if compiled:
2025-05-07T20:32:54.2688375Z                 op = torch.compile(op)
2025-05-07T20:32:54.2688478Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2688550Z     
2025-05-07T20:32:54.2688636Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2688640Z 
2025-05-07T20:32:54.2688739Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2688864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2688962Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2689057Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2689414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2689508Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2689990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2690084Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2690436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2690655Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2690993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2691087Z     kernel = self.compile(
2025-05-07T20:32:54.2691457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2691633Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2691755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2691762Z 
2025-05-07T20:32:54.2691962Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79ef15310>
2025-05-07T20:32:54.2692726Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2693459Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79ef6a020>}
2025-05-07T20:32:54.2694192Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2694377Z context = <triton._C.libtriton.ir.context object at 0x7fc79ee0a8f0>
2025-05-07T20:32:54.2694420Z 
2025-05-07T20:32:54.2694583Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2694834Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2694937Z                            module_map=module_map)
2025-05-07T20:32:54.2695098Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2695194Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2695269Z E       ^
2025-05-07T20:32:54.2695659Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2695664Z 
2025-05-07T20:32:54.2696068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2696073Z 
2025-05-07T20:32:54.2696176Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2696393Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2696470Z     T=2048,
2025-05-07T20:32:54.2696544Z     D=7168,
2025-05-07T20:32:54.2696622Z     scale_ub=None,
2025-05-07T20:32:54.2696704Z     contiguous=True,
2025-05-07T20:32:54.2696786Z     compiled=True,
2025-05-07T20:32:54.2696857Z )
2025-05-07T20:32:54.2697072Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2697239Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.2697243Z 
2025-05-07T20:32:54.2697320Z     @given(
2025-05-07T20:32:54.2697444Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2697541Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2697651Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2697768Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2697877Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2697953Z     )
2025-05-07T20:32:54.2698194Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2698284Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2698362Z         self,
2025-05-07T20:32:54.2698433Z         T: int,
2025-05-07T20:32:54.2698505Z         D: int,
2025-05-07T20:32:54.2698604Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2698690Z         contiguous: bool,
2025-05-07T20:32:54.2698774Z         compiled: bool,
2025-05-07T20:32:54.2698850Z     ) -> None:
2025-05-07T20:32:54.2698943Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2699013Z     
2025-05-07T20:32:54.2699184Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2699253Z     
2025-05-07T20:32:54.2699339Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2699461Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2699546Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2699626Z         x0 = x[:, :D]
2025-05-07T20:32:54.2699702Z         x1 = x[:, D:]
2025-05-07T20:32:54.2699774Z     
2025-05-07T20:32:54.2699855Z         if contiguous:
2025-05-07T20:32:54.2699943Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2700031Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2700104Z     
2025-05-07T20:32:54.2700193Z         if scale_ub is not None:
2025-05-07T20:32:54.2700293Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2700424Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2700547Z             )
2025-05-07T20:32:54.2700620Z         else:
2025-05-07T20:32:54.2700753Z             scale_ub_tensor = None
2025-05-07T20:32:54.2700827Z     
2025-05-07T20:32:54.2700955Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2701040Z             op = silu_mul_quant
2025-05-07T20:32:54.2701123Z             if compiled:
2025-05-07T20:32:54.2701220Z                 op = torch.compile(op)
2025-05-07T20:32:54.2701321Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2701390Z     
2025-05-07T20:32:54.2701520Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2701524Z 
2025-05-07T20:32:54.2701619Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2701744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2701845Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2701940Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2702303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2702391Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2702914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2703009Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2703408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2703623Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2703959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2704048Z     kernel = self.compile(
2025-05-07T20:32:54.2704422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2704594Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2704723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2704729Z 
2025-05-07T20:32:54.2704932Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79ee30680>
2025-05-07T20:32:54.2705687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2706187Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79ef6b240>}
2025-05-07T20:32:54.2706913Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2707099Z context = <triton._C.libtriton.ir.context object at 0x7fc79ee65830>
2025-05-07T20:32:54.2707106Z 
2025-05-07T20:32:54.2707271Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2707528Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2707633Z                            module_map=module_map)
2025-05-07T20:32:54.2707790Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2707882Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2707963Z E       ^
2025-05-07T20:32:54.2708308Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2708313Z 
2025-05-07T20:32:54.2708717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2708722Z 
2025-05-07T20:32:54.2708863Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2709079Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2709192Z     T=16384,
2025-05-07T20:32:54.2709268Z     D=5120,
2025-05-07T20:32:54.2709344Z     scale_ub=None,
2025-05-07T20:32:54.2709429Z     contiguous=False,
2025-05-07T20:32:54.2709510Z     compiled=False,
2025-05-07T20:32:54.2709581Z )
2025-05-07T20:32:54.2709795Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2709968Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.2710013Z 
2025-05-07T20:32:54.2710087Z     @given(
2025-05-07T20:32:54.2710203Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2710299Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2710413Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2710528Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2710639Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2710712Z     )
2025-05-07T20:32:54.2710997Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2711091Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2711166Z         self,
2025-05-07T20:32:54.2711240Z         T: int,
2025-05-07T20:32:54.2711317Z         D: int,
2025-05-07T20:32:54.2711413Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2711500Z         contiguous: bool,
2025-05-07T20:32:54.2711582Z         compiled: bool,
2025-05-07T20:32:54.2711656Z     ) -> None:
2025-05-07T20:32:54.2711751Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2711820Z     
2025-05-07T20:32:54.2711984Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2712056Z     
2025-05-07T20:32:54.2712147Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2712266Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2714059Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2714067Z 
2025-05-07T20:32:54.2714184Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.2714189Z 
2025-05-07T20:32:54.2714290Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2714505Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2714578Z     T=4096,
2025-05-07T20:32:54.2714654Z     D=7168,
2025-05-07T20:32:54.2714734Z     scale_ub=1200.0,
2025-05-07T20:32:54.2714817Z     contiguous=True,
2025-05-07T20:32:54.2714897Z     compiled=True,
2025-05-07T20:32:54.2714967Z )
2025-05-07T20:32:54.2715183Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2715348Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.2715353Z 
2025-05-07T20:32:54.2715423Z     @given(
2025-05-07T20:32:54.2715543Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2715640Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2715750Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2715865Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2715972Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2716044Z     )
2025-05-07T20:32:54.2716285Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2716374Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2716497Z         self,
2025-05-07T20:32:54.2716572Z         T: int,
2025-05-07T20:32:54.2716646Z         D: int,
2025-05-07T20:32:54.2716742Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2716867Z         contiguous: bool,
2025-05-07T20:32:54.2716951Z         compiled: bool,
2025-05-07T20:32:54.2717027Z     ) -> None:
2025-05-07T20:32:54.2717117Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2717187Z     
2025-05-07T20:32:54.2717353Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2717425Z     
2025-05-07T20:32:54.2717516Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2717681Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2719473Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2719487Z 
2025-05-07T20:32:54.2719601Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.2719605Z 
2025-05-07T20:32:54.2719703Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2719918Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2719996Z     T=16384,
2025-05-07T20:32:54.2720069Z     D=7168,
2025-05-07T20:32:54.2720147Z     scale_ub=None,
2025-05-07T20:32:54.2720231Z     contiguous=False,
2025-05-07T20:32:54.2720309Z     compiled=False,
2025-05-07T20:32:54.2720380Z )
2025-05-07T20:32:54.2720590Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2720764Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.2720771Z 
2025-05-07T20:32:54.2720845Z     @given(
2025-05-07T20:32:54.2720960Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2721059Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2721168Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2721276Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2721386Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2721459Z     )
2025-05-07T20:32:54.2721699Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2721791Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2721862Z         self,
2025-05-07T20:32:54.2721938Z         T: int,
2025-05-07T20:32:54.2722010Z         D: int,
2025-05-07T20:32:54.2722104Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2722193Z         contiguous: bool,
2025-05-07T20:32:54.2722273Z         compiled: bool,
2025-05-07T20:32:54.2722349Z     ) -> None:
2025-05-07T20:32:54.2722442Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2722509Z     
2025-05-07T20:32:54.2722673Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2724425Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2724434Z 
2025-05-07T20:32:54.2724547Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.2724551Z 
2025-05-07T20:32:54.2724653Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2724939Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2725015Z     T=2048,
2025-05-07T20:32:54.2725130Z     D=7168,
2025-05-07T20:32:54.2725212Z     scale_ub=1200.0,
2025-05-07T20:32:54.2725295Z     contiguous=True,
2025-05-07T20:32:54.2725373Z     compiled=True,
2025-05-07T20:32:54.2725444Z )
2025-05-07T20:32:54.2725656Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2725821Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.2725865Z 
2025-05-07T20:32:54.2725936Z     @given(
2025-05-07T20:32:54.2726053Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2726149Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2726262Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2726373Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2726481Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2726564Z     )
2025-05-07T20:32:54.2726810Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2726940Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2727019Z         self,
2025-05-07T20:32:54.2727093Z         T: int,
2025-05-07T20:32:54.2727164Z         D: int,
2025-05-07T20:32:54.2727262Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2727346Z         contiguous: bool,
2025-05-07T20:32:54.2727426Z         compiled: bool,
2025-05-07T20:32:54.2727503Z     ) -> None:
2025-05-07T20:32:54.2727597Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2727663Z     
2025-05-07T20:32:54.2727827Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2727899Z     
2025-05-07T20:32:54.2727988Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2728112Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2729852Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2729864Z 
2025-05-07T20:32:54.2729976Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.2729984Z 
2025-05-07T20:32:54.2730082Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2730303Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2730375Z     T=2048,
2025-05-07T20:32:54.2730450Z     D=7168,
2025-05-07T20:32:54.2730527Z     scale_ub=None,
2025-05-07T20:32:54.2730607Z     contiguous=True,
2025-05-07T20:32:54.2730688Z     compiled=False,
2025-05-07T20:32:54.2730763Z )
2025-05-07T20:32:54.2730974Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2731145Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.2731149Z 
2025-05-07T20:32:54.2731221Z     @given(
2025-05-07T20:32:54.2731334Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2731431Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2731538Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2731652Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2731763Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2731834Z     )
2025-05-07T20:32:54.2732072Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2732163Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2732232Z         self,
2025-05-07T20:32:54.2732354Z         T: int,
2025-05-07T20:32:54.2732428Z         D: int,
2025-05-07T20:32:54.2732520Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2732647Z         contiguous: bool,
2025-05-07T20:32:54.2732732Z         compiled: bool,
2025-05-07T20:32:54.2732806Z     ) -> None:
2025-05-07T20:32:54.2732900Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2733048Z     
2025-05-07T20:32:54.2733235Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2733311Z     
2025-05-07T20:32:54.2733401Z >       x_sign = torch.sign(x)
2025-05-07T20:32:54.2735143Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2735197Z 
2025-05-07T20:32:54.2735347Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:54.2735353Z 
2025-05-07T20:32:54.2735450Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2735669Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2735739Z     T=1,
2025-05-07T20:32:54.2735811Z     D=7168,
2025-05-07T20:32:54.2735889Z     scale_ub=1200.0,
2025-05-07T20:32:54.2735971Z     contiguous=True,
2025-05-07T20:32:54.2736056Z     compiled=False,
2025-05-07T20:32:54.2736124Z )
2025-05-07T20:32:54.2736334Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2736496Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.2736500Z 
2025-05-07T20:32:54.2736572Z     @given(
2025-05-07T20:32:54.2736689Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2736786Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2736899Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2737013Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2737121Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2737188Z     )
2025-05-07T20:32:54.2737432Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2737520Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2737596Z         self,
2025-05-07T20:32:54.2737675Z         T: int,
2025-05-07T20:32:54.2737746Z         D: int,
2025-05-07T20:32:54.2737837Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2737925Z         contiguous: bool,
2025-05-07T20:32:54.2738005Z         compiled: bool,
2025-05-07T20:32:54.2738078Z     ) -> None:
2025-05-07T20:32:54.2738174Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2738244Z     
2025-05-07T20:32:54.2738412Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2738483Z     
2025-05-07T20:32:54.2738573Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2738701Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2738786Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2738860Z         x0 = x[:, :D]
2025-05-07T20:32:54.2738938Z         x1 = x[:, D:]
2025-05-07T20:32:54.2739010Z     
2025-05-07T20:32:54.2739089Z         if contiguous:
2025-05-07T20:32:54.2739180Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2739268Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2739336Z     
2025-05-07T20:32:54.2739426Z         if scale_ub is not None:
2025-05-07T20:32:54.2739527Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2739658Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2739730Z             )
2025-05-07T20:32:54.2739801Z         else:
2025-05-07T20:32:54.2739942Z             scale_ub_tensor = None
2025-05-07T20:32:54.2740008Z     
2025-05-07T20:32:54.2740133Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2740268Z             op = silu_mul_quant
2025-05-07T20:32:54.2740349Z             if compiled:
2025-05-07T20:32:54.2740444Z                 op = torch.compile(op)
2025-05-07T20:32:54.2740548Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2740615Z     
2025-05-07T20:32:54.2740700Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2740704Z 
2025-05-07T20:32:54.2740802Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2740966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2741064Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2741159Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2741650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2741748Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2742108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2742363Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2742700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2742789Z     kernel = self.compile(
2025-05-07T20:32:54.2743163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2743336Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2743459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2743463Z 
2025-05-07T20:32:54.2743668Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79ee33cb0>
2025-05-07T20:32:54.2744431Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2744928Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79eca2520>}
2025-05-07T20:32:54.2745657Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2745852Z context = <triton._C.libtriton.ir.context object at 0x7fc79ea797b0>
2025-05-07T20:32:54.2745857Z 
2025-05-07T20:32:54.2746019Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2746275Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2746382Z                            module_map=module_map)
2025-05-07T20:32:54.2746538Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2746634Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2746712Z E       ^
2025-05-07T20:32:54.2747057Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2747062Z 
2025-05-07T20:32:54.2747467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2747474Z 
2025-05-07T20:32:54.2747572Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2747787Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2747860Z     T=128,
2025-05-07T20:32:54.2747932Z     D=5120,
2025-05-07T20:32:54.2748012Z     scale_ub=None,
2025-05-07T20:32:54.2748097Z     contiguous=True,
2025-05-07T20:32:54.2748178Z     compiled=False,
2025-05-07T20:32:54.2748293Z )
2025-05-07T20:32:54.2748509Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2748714Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.2748719Z 
2025-05-07T20:32:54.2748799Z     @given(
2025-05-07T20:32:54.2748914Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2749008Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2749122Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2749234Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2749383Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2749455Z     )
2025-05-07T20:32:54.2749693Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2749784Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2749856Z         self,
2025-05-07T20:32:54.2749930Z         T: int,
2025-05-07T20:32:54.2750008Z         D: int,
2025-05-07T20:32:54.2750101Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2750186Z         contiguous: bool,
2025-05-07T20:32:54.2750333Z         compiled: bool,
2025-05-07T20:32:54.2750410Z     ) -> None:
2025-05-07T20:32:54.2750499Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2750571Z     
2025-05-07T20:32:54.2750733Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2750801Z     
2025-05-07T20:32:54.2750893Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2751015Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2751108Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2751185Z         x0 = x[:, :D]
2025-05-07T20:32:54.2751259Z         x1 = x[:, D:]
2025-05-07T20:32:54.2751330Z     
2025-05-07T20:32:54.2751408Z         if contiguous:
2025-05-07T20:32:54.2751493Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2751581Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2751649Z     
2025-05-07T20:32:54.2751736Z         if scale_ub is not None:
2025-05-07T20:32:54.2751840Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2751973Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2752044Z             )
2025-05-07T20:32:54.2752121Z         else:
2025-05-07T20:32:54.2752211Z             scale_ub_tensor = None
2025-05-07T20:32:54.2752282Z     
2025-05-07T20:32:54.2752409Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2752496Z             op = silu_mul_quant
2025-05-07T20:32:54.2752580Z             if compiled:
2025-05-07T20:32:54.2752679Z                 op = torch.compile(op)
2025-05-07T20:32:54.2752780Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2752852Z     
2025-05-07T20:32:54.2752950Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2752956Z 
2025-05-07T20:32:54.2753057Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2753211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2753311Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2753407Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2753903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2753997Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2754350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2754565Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2754897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2754988Z     kernel = self.compile(
2025-05-07T20:32:54.2755360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2755530Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2755703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2755747Z 
2025-05-07T20:32:54.2755953Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79eaf1f10>
2025-05-07T20:32:54.2756712Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2757206Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79eca3420>}
2025-05-07T20:32:54.2757980Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2758168Z context = <triton._C.libtriton.ir.context object at 0x7fc79ea02230>
2025-05-07T20:32:54.2758172Z 
2025-05-07T20:32:54.2758370Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2758626Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2758728Z                            module_map=module_map)
2025-05-07T20:32:54.2758885Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2758980Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2759053Z E       ^
2025-05-07T20:32:54.2759653Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2759658Z 
2025-05-07T20:32:54.2760065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2760070Z 
2025-05-07T20:32:54.2760169Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2760388Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2760459Z     T=128,
2025-05-07T20:32:54.2760540Z     D=7168,
2025-05-07T20:32:54.2760619Z     scale_ub=None,
2025-05-07T20:32:54.2760697Z     contiguous=True,
2025-05-07T20:32:54.2760781Z     compiled=False,
2025-05-07T20:32:54.2760848Z )
2025-05-07T20:32:54.2761059Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2761227Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.2761234Z 
2025-05-07T20:32:54.2761306Z     @given(
2025-05-07T20:32:54.2761428Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2761525Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2761634Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2761748Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2761858Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2761929Z     )
2025-05-07T20:32:54.2762175Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2762267Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2762337Z         self,
2025-05-07T20:32:54.2762413Z         T: int,
2025-05-07T20:32:54.2762485Z         D: int,
2025-05-07T20:32:54.2762578Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2762667Z         contiguous: bool,
2025-05-07T20:32:54.2762750Z         compiled: bool,
2025-05-07T20:32:54.2762828Z     ) -> None:
2025-05-07T20:32:54.2762924Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2762994Z     
2025-05-07T20:32:54.2763161Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2763231Z     
2025-05-07T20:32:54.2763324Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2763448Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2763532Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2763686Z         x0 = x[:, :D]
2025-05-07T20:32:54.2763769Z         x1 = x[:, D:]
2025-05-07T20:32:54.2763837Z     
2025-05-07T20:32:54.2763973Z         if contiguous:
2025-05-07T20:32:54.2764066Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2764152Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2764222Z     
2025-05-07T20:32:54.2764309Z         if scale_ub is not None:
2025-05-07T20:32:54.2764409Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2764541Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2764609Z             )
2025-05-07T20:32:54.2764742Z         else:
2025-05-07T20:32:54.2764834Z             scale_ub_tensor = None
2025-05-07T20:32:54.2764904Z     
2025-05-07T20:32:54.2765029Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2765118Z             op = silu_mul_quant
2025-05-07T20:32:54.2765196Z             if compiled:
2025-05-07T20:32:54.2765290Z                 op = torch.compile(op)
2025-05-07T20:32:54.2765396Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2765465Z     
2025-05-07T20:32:54.2765551Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2765562Z 
2025-05-07T20:32:54.2765709Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2765837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2765935Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2766030Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2766518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2766616Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2766965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2767184Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2767514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2767607Z     kernel = self.compile(
2025-05-07T20:32:54.2767986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2768156Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2768277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2768282Z 
2025-05-07T20:32:54.2768484Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79eaf1340>
2025-05-07T20:32:54.2769242Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2769736Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79ea944a0>}
2025-05-07T20:32:54.2770468Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2770658Z context = <triton._C.libtriton.ir.context object at 0x7fc79ede40f0>
2025-05-07T20:32:54.2770662Z 
2025-05-07T20:32:54.2770822Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2771073Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2771181Z                            module_map=module_map)
2025-05-07T20:32:54.2771338Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2771434Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2771509Z E       ^
2025-05-07T20:32:54.2771854Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2771903Z 
2025-05-07T20:32:54.2772350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2772356Z 
2025-05-07T20:32:54.2772455Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2772672Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2772743Z     T=2048,
2025-05-07T20:32:54.2772817Z     D=7168,
2025-05-07T20:32:54.2772898Z     scale_ub=1200.0,
2025-05-07T20:32:54.2773078Z     contiguous=True,
2025-05-07T20:32:54.2773160Z     compiled=False,
2025-05-07T20:32:54.2773233Z )
2025-05-07T20:32:54.2773447Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2773617Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.2773622Z 
2025-05-07T20:32:54.2773698Z     @given(
2025-05-07T20:32:54.2773816Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2773912Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2774070Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2774184Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2774300Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2774368Z     )
2025-05-07T20:32:54.2774606Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2774699Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2774776Z         self,
2025-05-07T20:32:54.2774848Z         T: int,
2025-05-07T20:32:54.2774927Z         D: int,
2025-05-07T20:32:54.2775021Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2775107Z         contiguous: bool,
2025-05-07T20:32:54.2775192Z         compiled: bool,
2025-05-07T20:32:54.2775266Z     ) -> None:
2025-05-07T20:32:54.2775357Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2775430Z     
2025-05-07T20:32:54.2775593Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2777346Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2777354Z 
2025-05-07T20:32:54.2777467Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.2777472Z 
2025-05-07T20:32:54.2777573Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2777789Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2777863Z     T=1,
2025-05-07T20:32:54.2777940Z     D=5120,
2025-05-07T20:32:54.2778020Z     scale_ub=1200.0,
2025-05-07T20:32:54.2778104Z     contiguous=True,
2025-05-07T20:32:54.2778187Z     compiled=False,
2025-05-07T20:32:54.2778257Z )
2025-05-07T20:32:54.2778466Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2778629Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.2778634Z 
2025-05-07T20:32:54.2778709Z     @given(
2025-05-07T20:32:54.2778824Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2778924Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2779033Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2779148Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2779257Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2779328Z     )
2025-05-07T20:32:54.2779568Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2779751Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2779827Z         self,
2025-05-07T20:32:54.2779944Z         T: int,
2025-05-07T20:32:54.2780023Z         D: int,
2025-05-07T20:32:54.2780120Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2780206Z         contiguous: bool,
2025-05-07T20:32:54.2780288Z         compiled: bool,
2025-05-07T20:32:54.2780367Z     ) -> None:
2025-05-07T20:32:54.2780462Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2780530Z     
2025-05-07T20:32:54.2780695Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2780834Z     
2025-05-07T20:32:54.2780923Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2781048Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2781133Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2781209Z         x0 = x[:, :D]
2025-05-07T20:32:54.2781288Z         x1 = x[:, D:]
2025-05-07T20:32:54.2781365Z     
2025-05-07T20:32:54.2781445Z         if contiguous:
2025-05-07T20:32:54.2781535Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2781624Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2781732Z     
2025-05-07T20:32:54.2781820Z         if scale_ub is not None:
2025-05-07T20:32:54.2781922Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2782054Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2782125Z             )
2025-05-07T20:32:54.2782196Z         else:
2025-05-07T20:32:54.2782291Z             scale_ub_tensor = None
2025-05-07T20:32:54.2782365Z     
2025-05-07T20:32:54.2782491Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2782585Z             op = silu_mul_quant
2025-05-07T20:32:54.2782671Z             if compiled:
2025-05-07T20:32:54.2782767Z                 op = torch.compile(op)
2025-05-07T20:32:54.2782877Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2786052Z     
2025-05-07T20:32:54.2786160Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2786165Z 
2025-05-07T20:32:54.2786268Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2786405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2786508Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2786607Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2787108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2787208Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2787563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2787782Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2788124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2788215Z     kernel = self.compile(
2025-05-07T20:32:54.2788596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2788769Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2788894Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2788899Z 
2025-05-07T20:32:54.2789107Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79eaf2b70>
2025-05-07T20:32:54.2789869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2790370Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79ea95a80>}
2025-05-07T20:32:54.2791166Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2791400Z context = <triton._C.libtriton.ir.context object at 0x7fc79ed25170>
2025-05-07T20:32:54.2791405Z 
2025-05-07T20:32:54.2791566Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2791824Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2791932Z                            module_map=module_map)
2025-05-07T20:32:54.2792132Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2792229Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2792306Z E       ^
2025-05-07T20:32:54.2792651Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2792656Z 
2025-05-07T20:32:54.2793063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2793067Z 
2025-05-07T20:32:54.2793206Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2793428Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2793505Z     T=2048,
2025-05-07T20:32:54.2793577Z     D=5120,
2025-05-07T20:32:54.2793655Z     scale_ub=None,
2025-05-07T20:32:54.2793739Z     contiguous=True,
2025-05-07T20:32:54.2793820Z     compiled=False,
2025-05-07T20:32:54.2793889Z )
2025-05-07T20:32:54.2794109Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2794276Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.2794281Z 
2025-05-07T20:32:54.2794360Z     @given(
2025-05-07T20:32:54.2794477Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2794573Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2794693Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2794805Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2794920Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2794999Z     )
2025-05-07T20:32:54.2795240Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2795333Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2795404Z         self,
2025-05-07T20:32:54.2795475Z         T: int,
2025-05-07T20:32:54.2795552Z         D: int,
2025-05-07T20:32:54.2795650Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2795735Z         contiguous: bool,
2025-05-07T20:32:54.2795820Z         compiled: bool,
2025-05-07T20:32:54.2795899Z     ) -> None:
2025-05-07T20:32:54.2795990Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2796060Z     
2025-05-07T20:32:54.2796224Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2796299Z     
2025-05-07T20:32:54.2796387Z >       x_sign = torch.sign(x)
2025-05-07T20:32:54.2798145Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2798154Z 
2025-05-07T20:32:54.2798270Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:54.2798275Z 
2025-05-07T20:32:54.2798371Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2798591Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2798665Z     T=16384,
2025-05-07T20:32:54.2798783Z     D=5120,
2025-05-07T20:32:54.2798862Z     scale_ub=None,
2025-05-07T20:32:54.2798944Z     contiguous=True,
2025-05-07T20:32:54.2799062Z     compiled=False,
2025-05-07T20:32:54.2799134Z )
2025-05-07T20:32:54.2799344Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2799515Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.2799519Z 
2025-05-07T20:32:54.2799594Z     @given(
2025-05-07T20:32:54.2799709Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2799850Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2799958Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2800070Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2800181Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2800255Z     )
2025-05-07T20:32:54.2800493Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2800587Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2800659Z         self,
2025-05-07T20:32:54.2800733Z         T: int,
2025-05-07T20:32:54.2800856Z         D: int,
2025-05-07T20:32:54.2800952Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2801042Z         contiguous: bool,
2025-05-07T20:32:54.2801123Z         compiled: bool,
2025-05-07T20:32:54.2801194Z     ) -> None:
2025-05-07T20:32:54.2801288Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2801355Z     
2025-05-07T20:32:54.2801518Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2803268Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2803277Z 
2025-05-07T20:32:54.2803388Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.2803392Z 
2025-05-07T20:32:54.2803492Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2803706Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2803779Z     T=4096,
2025-05-07T20:32:54.2803856Z     D=5120,
2025-05-07T20:32:54.2803935Z     scale_ub=None,
2025-05-07T20:32:54.2804013Z     contiguous=True,
2025-05-07T20:32:54.2804095Z     compiled=False,
2025-05-07T20:32:54.2804163Z )
2025-05-07T20:32:54.2804375Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2804540Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.2804545Z 
2025-05-07T20:32:54.2804621Z     @given(
2025-05-07T20:32:54.2804737Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2804835Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2804947Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2805060Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2805169Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2805246Z     )
2025-05-07T20:32:54.2805485Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2805572Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2805651Z         self,
2025-05-07T20:32:54.2805723Z         T: int,
2025-05-07T20:32:54.2805798Z         D: int,
2025-05-07T20:32:54.2805894Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2805978Z         contiguous: bool,
2025-05-07T20:32:54.2806060Z         compiled: bool,
2025-05-07T20:32:54.2806138Z     ) -> None:
2025-05-07T20:32:54.2806226Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2806342Z     
2025-05-07T20:32:54.2806506Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2808286Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2808329Z 
2025-05-07T20:32:54.2808444Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.2808449Z 
2025-05-07T20:32:54.2808547Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2808764Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2808838Z     T=2048,
2025-05-07T20:32:54.2808909Z     D=5120,
2025-05-07T20:32:54.2808990Z     scale_ub=None,
2025-05-07T20:32:54.2809111Z     contiguous=False,
2025-05-07T20:32:54.2809194Z     compiled=False,
2025-05-07T20:32:54.2809264Z )
2025-05-07T20:32:54.2809474Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2809640Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.2809645Z 
2025-05-07T20:32:54.2809724Z     @given(
2025-05-07T20:32:54.2809836Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2809938Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2810046Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2810157Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2810269Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2810339Z     )
2025-05-07T20:32:54.2810581Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2810673Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2810749Z         self,
2025-05-07T20:32:54.2810825Z         T: int,
2025-05-07T20:32:54.2810901Z         D: int,
2025-05-07T20:32:54.2810994Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2811080Z         contiguous: bool,
2025-05-07T20:32:54.2811162Z         compiled: bool,
2025-05-07T20:32:54.2811236Z     ) -> None:
2025-05-07T20:32:54.2811328Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2811395Z     
2025-05-07T20:32:54.2811559Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2813376Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2813385Z 
2025-05-07T20:32:54.2813495Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.2813500Z 
2025-05-07T20:32:54.2813600Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2813814Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2813890Z     T=4096,
2025-05-07T20:32:54.2813965Z     D=7168,
2025-05-07T20:32:54.2814044Z     scale_ub=None,
2025-05-07T20:32:54.2814123Z     contiguous=True,
2025-05-07T20:32:54.2814203Z     compiled=True,
2025-05-07T20:32:54.2814273Z )
2025-05-07T20:32:54.2814487Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2814648Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.2814701Z 
2025-05-07T20:32:54.2814775Z     @given(
2025-05-07T20:32:54.2814951Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2815049Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2815158Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2815270Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2815377Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2815453Z     )
2025-05-07T20:32:54.2815692Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2815823Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2815900Z         self,
2025-05-07T20:32:54.2815975Z         T: int,
2025-05-07T20:32:54.2816048Z         D: int,
2025-05-07T20:32:54.2816147Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2816232Z         contiguous: bool,
2025-05-07T20:32:54.2816313Z         compiled: bool,
2025-05-07T20:32:54.2816390Z     ) -> None:
2025-05-07T20:32:54.2816480Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2816546Z     
2025-05-07T20:32:54.2816755Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2818498Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2818507Z 
2025-05-07T20:32:54.2818623Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.2818628Z 
2025-05-07T20:32:54.2818726Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2818944Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2819019Z     T=2048,
2025-05-07T20:32:54.2819095Z     D=5120,
2025-05-07T20:32:54.2819176Z     scale_ub=1200.0,
2025-05-07T20:32:54.2819257Z     contiguous=False,
2025-05-07T20:32:54.2819337Z     compiled=False,
2025-05-07T20:32:54.2819407Z )
2025-05-07T20:32:54.2819618Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2819785Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.2819793Z 
2025-05-07T20:32:54.2819872Z     @given(
2025-05-07T20:32:54.2819985Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2820083Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2820191Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2820301Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2820413Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2820488Z     )
2025-05-07T20:32:54.2820726Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2820821Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2820892Z         self,
2025-05-07T20:32:54.2820965Z         T: int,
2025-05-07T20:32:54.2821042Z         D: int,
2025-05-07T20:32:54.2821136Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2821224Z         contiguous: bool,
2025-05-07T20:32:54.2821305Z         compiled: bool,
2025-05-07T20:32:54.2821377Z     ) -> None:
2025-05-07T20:32:54.2821474Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2821545Z     
2025-05-07T20:32:54.2821707Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2823489Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2823530Z 
2025-05-07T20:32:54.2823643Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.2823647Z 
2025-05-07T20:32:54.2823746Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2823960Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2824071Z     T=4096,
2025-05-07T20:32:54.2824149Z     D=7168,
2025-05-07T20:32:54.2824227Z     scale_ub=1200.0,
2025-05-07T20:32:54.2824305Z     contiguous=True,
2025-05-07T20:32:54.2824385Z     compiled=False,
2025-05-07T20:32:54.2824454Z )
2025-05-07T20:32:54.2824665Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2824833Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.2824838Z 
2025-05-07T20:32:54.2824914Z     @given(
2025-05-07T20:32:54.2825068Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2825165Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2825273Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2825387Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2825495Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2825569Z     )
2025-05-07T20:32:54.2825809Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2825896Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2825973Z         self,
2025-05-07T20:32:54.2826047Z         T: int,
2025-05-07T20:32:54.2826118Z         D: int,
2025-05-07T20:32:54.2826214Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2826299Z         contiguous: bool,
2025-05-07T20:32:54.2826382Z         compiled: bool,
2025-05-07T20:32:54.2826460Z     ) -> None:
2025-05-07T20:32:54.2826550Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2826621Z     
2025-05-07T20:32:54.2826786Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2828520Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2828529Z 
2025-05-07T20:32:54.2828646Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.2828654Z 
2025-05-07T20:32:54.2828752Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2828970Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2829043Z     T=16384,
2025-05-07T20:32:54.2829114Z     D=7168,
2025-05-07T20:32:54.2829198Z     scale_ub=None,
2025-05-07T20:32:54.2829278Z     contiguous=False,
2025-05-07T20:32:54.2829354Z     compiled=True,
2025-05-07T20:32:54.2829423Z )
2025-05-07T20:32:54.2829631Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2829800Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.2829806Z 
2025-05-07T20:32:54.2829885Z     @given(
2025-05-07T20:32:54.2829998Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2830094Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2830203Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2830313Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2830472Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2830543Z     )
2025-05-07T20:32:54.2830823Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2830918Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2830990Z         self,
2025-05-07T20:32:54.2831061Z         T: int,
2025-05-07T20:32:54.2831136Z         D: int,
2025-05-07T20:32:54.2831232Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2831320Z         contiguous: bool,
2025-05-07T20:32:54.2831400Z         compiled: bool,
2025-05-07T20:32:54.2831515Z     ) -> None:
2025-05-07T20:32:54.2831607Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2831674Z     
2025-05-07T20:32:54.2831835Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2833612Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2833622Z 
2025-05-07T20:32:54.2833732Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.2833737Z 
2025-05-07T20:32:54.2833836Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2834052Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2834122Z     T=4096,
2025-05-07T20:32:54.2834198Z     D=7168,
2025-05-07T20:32:54.2834277Z     scale_ub=None,
2025-05-07T20:32:54.2834354Z     contiguous=True,
2025-05-07T20:32:54.2834439Z     compiled=False,
2025-05-07T20:32:54.2834509Z )
2025-05-07T20:32:54.2834721Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2834890Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.2834897Z 
2025-05-07T20:32:54.2834969Z     @given(
2025-05-07T20:32:54.2835089Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2835185Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2835293Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2835406Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2835514Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2835592Z     )
2025-05-07T20:32:54.2835829Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2835918Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2835993Z         self,
2025-05-07T20:32:54.2836066Z         T: int,
2025-05-07T20:32:54.2836138Z         D: int,
2025-05-07T20:32:54.2836238Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2836324Z         contiguous: bool,
2025-05-07T20:32:54.2836405Z         compiled: bool,
2025-05-07T20:32:54.2836486Z     ) -> None:
2025-05-07T20:32:54.2836579Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2836645Z     
2025-05-07T20:32:54.2836809Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2838538Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2838547Z 
2025-05-07T20:32:54.2838707Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.2838711Z 
2025-05-07T20:32:54.2838809Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2839068Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2839142Z     T=16384,
2025-05-07T20:32:54.2839220Z     D=7168,
2025-05-07T20:32:54.2839298Z     scale_ub=None,
2025-05-07T20:32:54.2839375Z     contiguous=True,
2025-05-07T20:32:54.2839457Z     compiled=False,
2025-05-07T20:32:54.2839524Z )
2025-05-07T20:32:54.2839734Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2839946Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.2839950Z 
2025-05-07T20:32:54.2840021Z     @given(
2025-05-07T20:32:54.2840133Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2840231Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2840339Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2840456Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2840565Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2840672Z     )
2025-05-07T20:32:54.2840914Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2841002Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2841074Z         self,
2025-05-07T20:32:54.2841146Z         T: int,
2025-05-07T20:32:54.2841218Z         D: int,
2025-05-07T20:32:54.2841311Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2841403Z         contiguous: bool,
2025-05-07T20:32:54.2841484Z         compiled: bool,
2025-05-07T20:32:54.2841558Z     ) -> None:
2025-05-07T20:32:54.2841651Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2841718Z     
2025-05-07T20:32:54.2841878Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2843621Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2843630Z 
2025-05-07T20:32:54.2843744Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.2843752Z 
2025-05-07T20:32:54.2843848Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2844062Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2844138Z     T=16384,
2025-05-07T20:32:54.2844213Z     D=7168,
2025-05-07T20:32:54.2844293Z     scale_ub=1200.0,
2025-05-07T20:32:54.2844373Z     contiguous=True,
2025-05-07T20:32:54.2844456Z     compiled=False,
2025-05-07T20:32:54.2844526Z )
2025-05-07T20:32:54.2844737Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2844910Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.2844914Z 
2025-05-07T20:32:54.2844990Z     @given(
2025-05-07T20:32:54.2845103Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2845196Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2845309Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2845423Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2845533Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2845602Z     )
2025-05-07T20:32:54.2845838Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2845927Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2846004Z         self,
2025-05-07T20:32:54.2846076Z         T: int,
2025-05-07T20:32:54.2846221Z         D: int,
2025-05-07T20:32:54.2846314Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2846398Z         contiguous: bool,
2025-05-07T20:32:54.2846521Z         compiled: bool,
2025-05-07T20:32:54.2846597Z     ) -> None:
2025-05-07T20:32:54.2846686Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2846760Z     
2025-05-07T20:32:54.2846924Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2848658Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2848709Z 
2025-05-07T20:32:54.2848821Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.2848828Z 
2025-05-07T20:32:54.2848965Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2849186Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2849261Z     T=128,
2025-05-07T20:32:54.2849333Z     D=5120,
2025-05-07T20:32:54.2849414Z     scale_ub=1200.0,
2025-05-07T20:32:54.2849496Z     contiguous=False,
2025-05-07T20:32:54.2849574Z     compiled=False,
2025-05-07T20:32:54.2849650Z )
2025-05-07T20:32:54.2849859Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2850025Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.2850029Z 
2025-05-07T20:32:54.2850100Z     @given(
2025-05-07T20:32:54.2850212Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2850308Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2850422Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2850535Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2850649Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2850718Z     )
2025-05-07T20:32:54.2850955Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2851046Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2851117Z         self,
2025-05-07T20:32:54.2851192Z         T: int,
2025-05-07T20:32:54.2851261Z         D: int,
2025-05-07T20:32:54.2851357Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2851445Z         contiguous: bool,
2025-05-07T20:32:54.2851526Z         compiled: bool,
2025-05-07T20:32:54.2851599Z     ) -> None:
2025-05-07T20:32:54.2851693Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2851762Z     
2025-05-07T20:32:54.2851922Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2851996Z     
2025-05-07T20:32:54.2852085Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2852208Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2852299Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2852378Z         x0 = x[:, :D]
2025-05-07T20:32:54.2852462Z         x1 = x[:, D:]
2025-05-07T20:32:54.2852529Z     
2025-05-07T20:32:54.2852606Z         if contiguous:
2025-05-07T20:32:54.2852695Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2852780Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2852845Z     
2025-05-07T20:32:54.2852936Z         if scale_ub is not None:
2025-05-07T20:32:54.2853129Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2853260Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2853334Z             )
2025-05-07T20:32:54.2853406Z         else:
2025-05-07T20:32:54.2853496Z             scale_ub_tensor = None
2025-05-07T20:32:54.2853564Z     
2025-05-07T20:32:54.2853690Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2853826Z             op = silu_mul_quant
2025-05-07T20:32:54.2853911Z             if compiled:
2025-05-07T20:32:54.2854048Z                 op = torch.compile(op)
2025-05-07T20:32:54.2854154Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2854222Z     
2025-05-07T20:32:54.2854308Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2854312Z 
2025-05-07T20:32:54.2854408Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2854532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2854668Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2854767Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2855259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2855358Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2855709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2855930Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2856316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2856407Z     kernel = self.compile(
2025-05-07T20:32:54.2856780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2856952Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2857079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2857084Z 
2025-05-07T20:32:54.2857288Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79eb234a0>
2025-05-07T20:32:54.2858051Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2858552Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79e8247c0>}
2025-05-07T20:32:54.2859839Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2860033Z context = <triton._C.libtriton.ir.context object at 0x7fc79e814330>
2025-05-07T20:32:54.2860042Z 
2025-05-07T20:32:54.2860205Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2860458Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2860563Z                            module_map=module_map)
2025-05-07T20:32:54.2860724Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2860818Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2860892Z E       ^
2025-05-07T20:32:54.2861243Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2861248Z 
2025-05-07T20:32:54.2861652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2861656Z 
2025-05-07T20:32:54.2861760Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2861978Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2862055Z     T=2048,
2025-05-07T20:32:54.2862128Z     D=7168,
2025-05-07T20:32:54.2862205Z     scale_ub=None,
2025-05-07T20:32:54.2862288Z     contiguous=False,
2025-05-07T20:32:54.2862367Z     compiled=False,
2025-05-07T20:32:54.2862440Z )
2025-05-07T20:32:54.2862655Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2862903Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.2862964Z 
2025-05-07T20:32:54.2863038Z     @given(
2025-05-07T20:32:54.2863157Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2863254Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2863369Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2863481Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2863589Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2863724Z     )
2025-05-07T20:32:54.2863963Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2864053Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2864126Z         self,
2025-05-07T20:32:54.2864198Z         T: int,
2025-05-07T20:32:54.2864271Z         D: int,
2025-05-07T20:32:54.2864368Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2864456Z         contiguous: bool,
2025-05-07T20:32:54.2864537Z         compiled: bool,
2025-05-07T20:32:54.2864611Z     ) -> None:
2025-05-07T20:32:54.2864762Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2864837Z     
2025-05-07T20:32:54.2865001Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2866744Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2866756Z 
2025-05-07T20:32:54.2866874Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.2866881Z 
2025-05-07T20:32:54.2866979Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2867202Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2867273Z     T=128,
2025-05-07T20:32:54.2867343Z     D=7168,
2025-05-07T20:32:54.2867422Z     scale_ub=1200.0,
2025-05-07T20:32:54.2867503Z     contiguous=True,
2025-05-07T20:32:54.2867584Z     compiled=True,
2025-05-07T20:32:54.2867653Z )
2025-05-07T20:32:54.2867862Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2868028Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.2868032Z 
2025-05-07T20:32:54.2868105Z     @given(
2025-05-07T20:32:54.2868219Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2868318Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2868427Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2868540Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2868650Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2868723Z     )
2025-05-07T20:32:54.2868968Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2869057Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2869127Z         self,
2025-05-07T20:32:54.2869201Z         T: int,
2025-05-07T20:32:54.2869274Z         D: int,
2025-05-07T20:32:54.2869367Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2869455Z         contiguous: bool,
2025-05-07T20:32:54.2869539Z         compiled: bool,
2025-05-07T20:32:54.2869617Z     ) -> None:
2025-05-07T20:32:54.2869712Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2869781Z     
2025-05-07T20:32:54.2869942Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2870013Z     
2025-05-07T20:32:54.2870102Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2870224Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2870359Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2870434Z         x0 = x[:, :D]
2025-05-07T20:32:54.2870551Z         x1 = x[:, D:]
2025-05-07T20:32:54.2870619Z     
2025-05-07T20:32:54.2870700Z         if contiguous:
2025-05-07T20:32:54.2870790Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2870874Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2870944Z     
2025-05-07T20:32:54.2871034Z         if scale_ub is not None:
2025-05-07T20:32:54.2871136Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2871307Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2871379Z             )
2025-05-07T20:32:54.2871451Z         else:
2025-05-07T20:32:54.2871542Z             scale_ub_tensor = None
2025-05-07T20:32:54.2871612Z     
2025-05-07T20:32:54.2871737Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2871827Z             op = silu_mul_quant
2025-05-07T20:32:54.2871909Z             if compiled:
2025-05-07T20:32:54.2872004Z                 op = torch.compile(op)
2025-05-07T20:32:54.2872115Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2872250Z     
2025-05-07T20:32:54.2872339Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2872343Z 
2025-05-07T20:32:54.2872438Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2872562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2872656Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2872758Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2873123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.2873214Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.2873696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2873789Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2874140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2874361Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2874691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2874786Z     kernel = self.compile(
2025-05-07T20:32:54.2875156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2875329Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2875453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2875458Z 
2025-05-07T20:32:54.2875658Z self = <triton.compiler.compiler.ASTSource object at 0x7fc79e8c6330>
2025-05-07T20:32:54.2876422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2876919Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc8d142d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc79e825940>}
2025-05-07T20:32:54.2877651Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2877839Z context = <triton._C.libtriton.ir.context object at 0x7fc79e9ba9f0>
2025-05-07T20:32:54.2877843Z 
2025-05-07T20:32:54.2878004Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2878260Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2878414Z                            module_map=module_map)
2025-05-07T20:32:54.2878577Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2878712Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2878788Z E       ^
2025-05-07T20:32:54.2879137Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2879142Z 
2025-05-07T20:32:54.2879544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2879548Z 
2025-05-07T20:32:54.2879686Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2879903Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2879976Z     T=128,
2025-05-07T20:32:54.2880051Z     D=7168,
2025-05-07T20:32:54.2880131Z     scale_ub=1200.0,
2025-05-07T20:32:54.2880211Z     contiguous=True,
2025-05-07T20:32:54.2880293Z     compiled=False,
2025-05-07T20:32:54.2880366Z )
2025-05-07T20:32:54.2880579Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2880785Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.2880791Z 
2025-05-07T20:32:54.2880868Z     @given(
2025-05-07T20:32:54.2880985Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2881079Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2881190Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2881304Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2881414Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2881484Z     )
2025-05-07T20:32:54.2881725Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2881814Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2881888Z         self,
2025-05-07T20:32:54.2881961Z         T: int,
2025-05-07T20:32:54.2882034Z         D: int,
2025-05-07T20:32:54.2882132Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2882219Z         contiguous: bool,
2025-05-07T20:32:54.2882302Z         compiled: bool,
2025-05-07T20:32:54.2882384Z     ) -> None:
2025-05-07T20:32:54.2882477Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2882548Z     
2025-05-07T20:32:54.2882715Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2882786Z     
2025-05-07T20:32:54.2882879Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2883002Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2884746Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2884759Z 
2025-05-07T20:32:54.2884876Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.2884881Z 
2025-05-07T20:32:54.2884979Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2885197Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2885268Z     T=128,
2025-05-07T20:32:54.2885340Z     D=5120,
2025-05-07T20:32:54.2885421Z     scale_ub=1200.0,
2025-05-07T20:32:54.2885504Z     contiguous=True,
2025-05-07T20:32:54.2885581Z     compiled=True,
2025-05-07T20:32:54.2885656Z )
2025-05-07T20:32:54.2885866Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2886026Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.2886031Z 
2025-05-07T20:32:54.2886108Z     @given(
2025-05-07T20:32:54.2886268Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2886368Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2886518Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2886630Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2886744Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2886814Z     )
2025-05-07T20:32:54.2887054Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2887145Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2887257Z         self,
2025-05-07T20:32:54.2887329Z         T: int,
2025-05-07T20:32:54.2887405Z         D: int,
2025-05-07T20:32:54.2887499Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2887584Z         contiguous: bool,
2025-05-07T20:32:54.2887669Z         compiled: bool,
2025-05-07T20:32:54.2887739Z     ) -> None:
2025-05-07T20:32:54.2887841Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2887912Z     
2025-05-07T20:32:54.2888073Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2888143Z     
2025-05-07T20:32:54.2888273Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2888397Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2890134Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2890142Z 
2025-05-07T20:32:54.2890254Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.2890260Z 
2025-05-07T20:32:54.2890358Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2890574Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2890651Z     T=128,
2025-05-07T20:32:54.2890731Z     D=7168,
2025-05-07T20:32:54.2890806Z     scale_ub=None,
2025-05-07T20:32:54.2890889Z     contiguous=True,
2025-05-07T20:32:54.2890968Z     compiled=True,
2025-05-07T20:32:54.2891036Z )
2025-05-07T20:32:54.2891247Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2891406Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.2891413Z 
2025-05-07T20:32:54.2891486Z     @given(
2025-05-07T20:32:54.2891603Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2891697Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2891807Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2891923Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2892032Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2892107Z     )
2025-05-07T20:32:54.2892349Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2892438Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2892511Z         self,
2025-05-07T20:32:54.2892584Z         T: int,
2025-05-07T20:32:54.2892655Z         D: int,
2025-05-07T20:32:54.2892752Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2892836Z         contiguous: bool,
2025-05-07T20:32:54.2892917Z         compiled: bool,
2025-05-07T20:32:54.2893070Z     ) -> None:
2025-05-07T20:32:54.2893162Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2893233Z     
2025-05-07T20:32:54.2893400Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2895178Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2895226Z 
2025-05-07T20:32:54.2895338Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.2895467Z =============================== warnings summary ===============================
2025-05-07T20:32:54.2895809Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:54.2896103Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:54.2896391Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:54.2897291Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:54.2897517Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:54.2897522Z 
2025-05-07T20:32:54.2897732Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:54.2897896Z ================= 1 failed, 1 deselected, 3 warnings in 13.10s =================
2025-05-07T20:32:55.7921630Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:55.8544768Z [EXEC] [ATTEMPT 1/2] Command attempt failed.
2025-05-07T20:32:55.8545082Z 
2025-05-07T20:32:57.8561593Z [EXEC] [ATTEMPT 2/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:33:00.0089933Z ============================= test session starts ==============================
2025-05-07T20:33:00.0091306Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:33:00.0092233Z cachedir: .pytest_cache
2025-05-07T20:33:00.0093394Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:33:00.0100385Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:33:00.0100839Z plugins: hypothesis-6.131.14
2025-05-07T20:33:01.6152995Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:33:01.7240757Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:33:01.7241305Z run-last-failure: rerun previous 1 failure
2025-05-07T20:33:01.7241603Z 
2025-05-07T20:33:04.0718497Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.0720313Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.0721176Z     T=1,
2025-05-07T20:33:04.0721564Z     D=5120,
2025-05-07T20:33:04.0721968Z     scale_ub=None,
2025-05-07T20:33:04.0722415Z     contiguous=True,
2025-05-07T20:33:04.0722865Z     compiled=True,
2025-05-07T20:33:04.0723306Z )
2025-05-07T20:33:04.0723956Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.0724931Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:04.0725451Z 
2025-05-07T20:33:04.0725613Z     @given(
2025-05-07T20:33:04.0726083Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.0726684Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.0727425Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.0727856Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.0728194Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.0728477Z     )
2025-05-07T20:33:04.0728829Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.0729276Z     def test_silu_mul_quant(
2025-05-07T20:33:04.0729525Z         self,
2025-05-07T20:33:04.0729719Z         T: int,
2025-05-07T20:33:04.0729927Z         D: int,
2025-05-07T20:33:04.0730250Z         scale_ub: Optional[float],
2025-05-07T20:33:04.0730518Z         contiguous: bool,
2025-05-07T20:33:04.0730765Z         compiled: bool,
2025-05-07T20:33:04.0730996Z     ) -> None:
2025-05-07T20:33:04.0731211Z         torch.manual_seed(2025)
2025-05-07T20:33:04.0731458Z     
2025-05-07T20:33:04.0731739Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.0732079Z     
2025-05-07T20:33:04.0732277Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.0732572Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.0733094Z         x = x_sign * x_clamp
2025-05-07T20:33:04.0733343Z         x0 = x[:, :D]
2025-05-07T20:33:04.0733567Z         x1 = x[:, D:]
2025-05-07T20:33:04.0733775Z     
2025-05-07T20:33:04.0733966Z         if contiguous:
2025-05-07T20:33:04.0734207Z             x0 = x0.contiguous()
2025-05-07T20:33:04.0734462Z             x1 = x1.contiguous()
2025-05-07T20:33:04.0734712Z     
2025-05-07T20:33:04.0734918Z         if scale_ub is not None:
2025-05-07T20:33:04.0735195Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.0735535Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.0735849Z             )
2025-05-07T20:33:04.0736051Z         else:
2025-05-07T20:33:04.0736261Z             scale_ub_tensor = None
2025-05-07T20:33:04.0736523Z     
2025-05-07T20:33:04.0736764Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.0737083Z             op = silu_mul_quant
2025-05-07T20:33:04.0737344Z             if compiled:
2025-05-07T20:33:04.0737603Z                 op = torch.compile(op)
2025-05-07T20:33:04.0737900Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.0738183Z     
2025-05-07T20:33:04.0738389Z         y_fp8, y_scale = fn()
2025-05-07T20:33:04.0738672Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:04.0738974Z     
2025-05-07T20:33:04.0739221Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.0739555Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:04.0739858Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:04.0740177Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:04.0740538Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:04.0740848Z     
2025-05-07T20:33:04.0741057Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:04.0741255Z 
2025-05-07T20:33:04.0741366Z moe/activation_test.py:126: 
2025-05-07T20:33:04.0741667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0742007Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:04.0742336Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:04.0743127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:04.0743874Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:04.0744426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.0745105Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.0745784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:04.0746569Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:04.0747348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:04.0747985Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:04.0748578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:04.0749095Z     fn()
2025-05-07T20:33:04.0749603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:04.0750232Z     self.fn.run(
2025-05-07T20:33:04.0750701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.0751231Z     kernel = self.compile(
2025-05-07T20:33:04.0751770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.0752417Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.0752868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0753095Z 
2025-05-07T20:33:04.0753310Z self = <triton.compiler.compiler.ASTSource object at 0x7f057ec4de20>
2025-05-07T20:33:04.0754394Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.0755771Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f057db01c60>}
2025-05-07T20:33:04.0757098Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.0758121Z context = <triton._C.libtriton.ir.context object at 0x7f057dd3d170>
2025-05-07T20:33:04.0758412Z 
2025-05-07T20:33:04.0758591Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.0759101Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.0760096Z                            module_map=module_map)
2025-05-07T20:33:04.0760469Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.0760956Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:04.0761250Z E       ^
2025-05-07T20:33:04.0761728Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.0762182Z 
2025-05-07T20:33:04.0762607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.0763120Z 
2025-05-07T20:33:04.0763237Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.0763659Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.0764075Z     T=2048,
2025-05-07T20:33:04.0764286Z     D=5120,
2025-05-07T20:33:04.0764486Z     scale_ub=1200.0,
2025-05-07T20:33:04.0764724Z     contiguous=True,
2025-05-07T20:33:04.0764964Z     compiled=False,
2025-05-07T20:33:04.0765178Z )
2025-05-07T20:33:04.8084334Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.8085171Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:04.8085568Z 
2025-05-07T20:33:04.8085681Z     @given(
2025-05-07T20:33:04.8085995Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.8086326Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.8086635Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.8086980Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.8087647Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.8087935Z     )
2025-05-07T20:33:04.8088383Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.8088839Z     def test_silu_mul_quant(
2025-05-07T20:33:04.8089089Z         self,
2025-05-07T20:33:04.8089288Z         T: int,
2025-05-07T20:33:04.8089492Z         D: int,
2025-05-07T20:33:04.8089717Z         scale_ub: Optional[float],
2025-05-07T20:33:04.8089987Z         contiguous: bool,
2025-05-07T20:33:04.8090230Z         compiled: bool,
2025-05-07T20:33:04.8090540Z     ) -> None:
2025-05-07T20:33:04.8090755Z         torch.manual_seed(2025)
2025-05-07T20:33:04.8091006Z     
2025-05-07T20:33:04.8091283Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.8091623Z     
2025-05-07T20:33:04.8091827Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.8092126Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.8092432Z         x = x_sign * x_clamp
2025-05-07T20:33:04.8092684Z         x0 = x[:, :D]
2025-05-07T20:33:04.8092912Z         x1 = x[:, D:]
2025-05-07T20:33:04.8093319Z     
2025-05-07T20:33:04.8093519Z         if contiguous:
2025-05-07T20:33:04.8093760Z             x0 = x0.contiguous()
2025-05-07T20:33:04.8094016Z             x1 = x1.contiguous()
2025-05-07T20:33:04.8094267Z     
2025-05-07T20:33:04.8094474Z         if scale_ub is not None:
2025-05-07T20:33:04.8094757Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.8095091Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.8095415Z             )
2025-05-07T20:33:04.8095616Z         else:
2025-05-07T20:33:04.8095828Z             scale_ub_tensor = None
2025-05-07T20:33:04.8096089Z     
2025-05-07T20:33:04.8096337Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.8096657Z             op = silu_mul_quant
2025-05-07T20:33:04.8096912Z             if compiled:
2025-05-07T20:33:04.8097162Z                 op = torch.compile(op)
2025-05-07T20:33:04.8097461Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.8097744Z     
2025-05-07T20:33:04.8097939Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.8098108Z 
2025-05-07T20:33:04.8098210Z moe/activation_test.py:117: 
2025-05-07T20:33:04.8098513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.8098846Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.8099130Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.8099820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.8100508Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.8101038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.8101716Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.8102375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.8102909Z     kernel = self.compile(
2025-05-07T20:33:04.8103444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.8104091Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.8104488Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.8104719Z 
2025-05-07T20:33:04.8104930Z self = <triton.compiler.compiler.ASTSource object at 0x7f057db08590>
2025-05-07T20:33:04.8105995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.8107450Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f057d958220>}
2025-05-07T20:33:04.8108816Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.8109828Z context = <triton._C.libtriton.ir.context object at 0x7f057dd0d430>
2025-05-07T20:33:04.8110110Z 
2025-05-07T20:33:04.8110275Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.8110832Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.8111298Z                            module_map=module_map)
2025-05-07T20:33:04.8111664Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.8112014Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.8112281Z E       ^
2025-05-07T20:33:04.8112760Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.8113213Z 
2025-05-07T20:33:04.8113668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.8114180Z 
2025-05-07T20:33:04.8114291Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.8114710Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.8115119Z     T=2048,
2025-05-07T20:33:04.8115317Z     D=5120,
2025-05-07T20:33:04.8115517Z     scale_ub=1200.0,
2025-05-07T20:33:04.8115747Z     contiguous=True,
2025-05-07T20:33:04.8115970Z     compiled=True,
2025-05-07T20:33:04.8116183Z )
2025-05-07T20:33:04.8116511Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.8116999Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:04.8117278Z 
2025-05-07T20:33:04.8117355Z     @given(
2025-05-07T20:33:04.8117589Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.8117902Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.8118214Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.8118547Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.8118879Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.8119159Z     )
2025-05-07T20:33:04.8119510Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.8119952Z     def test_silu_mul_quant(
2025-05-07T20:33:04.8120192Z         self,
2025-05-07T20:33:04.8120392Z         T: int,
2025-05-07T20:33:04.8120593Z         D: int,
2025-05-07T20:33:04.8120808Z         scale_ub: Optional[float],
2025-05-07T20:33:04.8121079Z         contiguous: bool,
2025-05-07T20:33:04.8121321Z         compiled: bool,
2025-05-07T20:33:04.8121543Z     ) -> None:
2025-05-07T20:33:04.8121762Z         torch.manual_seed(2025)
2025-05-07T20:33:04.8122010Z     
2025-05-07T20:33:04.8122280Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.8122628Z     
2025-05-07T20:33:04.8122829Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.8123116Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.8123433Z         x = x_sign * x_clamp
2025-05-07T20:33:04.8123685Z         x0 = x[:, :D]
2025-05-07T20:33:04.8123899Z         x1 = x[:, D:]
2025-05-07T20:33:04.8124118Z     
2025-05-07T20:33:04.8124308Z         if contiguous:
2025-05-07T20:33:04.8124550Z             x0 = x0.contiguous()
2025-05-07T20:33:04.8124806Z             x1 = x1.contiguous()
2025-05-07T20:33:04.8125052Z     
2025-05-07T20:33:04.8125252Z         if scale_ub is not None:
2025-05-07T20:33:04.8125524Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.8125871Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.8126248Z             )
2025-05-07T20:33:04.8126438Z         else:
2025-05-07T20:33:04.8126653Z             scale_ub_tensor = None
2025-05-07T20:33:04.8126905Z     
2025-05-07T20:33:04.8127186Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.8127513Z             op = silu_mul_quant
2025-05-07T20:33:04.8127770Z             if compiled:
2025-05-07T20:33:04.8128020Z                 op = torch.compile(op)
2025-05-07T20:33:04.8128321Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.8128600Z     
2025-05-07T20:33:04.8128794Z         y_fp8, y_scale = fn()
2025-05-07T20:33:04.8129130Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:04.8129427Z     
2025-05-07T20:33:04.8129669Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.8129999Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:04.8130293Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:04.8130612Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:04.8130966Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:04.8131278Z     
2025-05-07T20:33:04.8131532Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:04.8131730Z 
2025-05-07T20:33:04.8131829Z moe/activation_test.py:126: 
2025-05-07T20:33:04.8132125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.8132462Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:04.8132791Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:04.8133633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:04.8134385Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:04.8134926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.8135592Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.8136280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:04.8137000Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:04.8137722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:04.8138348Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:04.8138945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:04.8139457Z     fn()
2025-05-07T20:33:04.8139960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:04.8140528Z     self.fn.run(
2025-05-07T20:33:04.8140998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.8141528Z     kernel = self.compile(
2025-05-07T20:33:04.8142063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.8142705Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.8143097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.8143322Z 
2025-05-07T20:33:04.8143530Z self = <triton.compiler.compiler.ASTSource object at 0x7f057da5f7a0>
2025-05-07T20:33:04.8144585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.8145942Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f057d9596c0>}
2025-05-07T20:33:04.8147403Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.8148412Z context = <triton._C.libtriton.ir.context object at 0x7f057c52e670>
2025-05-07T20:33:04.8148700Z 
2025-05-07T20:33:04.8148874Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.8149384Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.8149885Z                            module_map=module_map)
2025-05-07T20:33:04.8150254Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.8150610Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:04.8150882Z E       ^
2025-05-07T20:33:04.8151348Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.8151795Z 
2025-05-07T20:33:04.8152213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.8152758Z 
2025-05-07T20:33:04.8152869Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.8153283Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.8153686Z     T=16384,
2025-05-07T20:33:04.8153878Z     D=7168,
2025-05-07T20:33:04.8154083Z     scale_ub=1200.0,
2025-05-07T20:33:04.8154318Z     contiguous=False,
2025-05-07T20:33:04.8154546Z     compiled=False,
2025-05-07T20:33:04.8154762Z )
2025-05-07T20:33:05.5408614Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.5409413Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:05.5409802Z 
2025-05-07T20:33:05.5409925Z     @given(
2025-05-07T20:33:05.5410240Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.5410690Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.5411119Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.5411486Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.5411813Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.5412108Z     )
2025-05-07T20:33:05.5412467Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.5412908Z     def test_silu_mul_quant(
2025-05-07T20:33:05.5413248Z         self,
2025-05-07T20:33:05.5413452Z         T: int,
2025-05-07T20:33:05.5413655Z         D: int,
2025-05-07T20:33:05.5413882Z         scale_ub: Optional[float],
2025-05-07T20:33:05.5414163Z         contiguous: bool,
2025-05-07T20:33:05.5414408Z         compiled: bool,
2025-05-07T20:33:05.5414651Z     ) -> None:
2025-05-07T20:33:05.5414877Z         torch.manual_seed(2025)
2025-05-07T20:33:05.5415128Z     
2025-05-07T20:33:05.5415411Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.5415766Z     
2025-05-07T20:33:05.5415970Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.5416267Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.5416589Z         x = x_sign * x_clamp
2025-05-07T20:33:05.5416842Z         x0 = x[:, :D]
2025-05-07T20:33:05.5417059Z         x1 = x[:, D:]
2025-05-07T20:33:05.5417273Z     
2025-05-07T20:33:05.5417469Z         if contiguous:
2025-05-07T20:33:05.5417704Z             x0 = x0.contiguous()
2025-05-07T20:33:05.5417967Z             x1 = x1.contiguous()
2025-05-07T20:33:05.5418213Z     
2025-05-07T20:33:05.5418403Z         if scale_ub is not None:
2025-05-07T20:33:05.5418680Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.5419019Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.5419321Z             )
2025-05-07T20:33:05.5419521Z         else:
2025-05-07T20:33:05.5419735Z             scale_ub_tensor = None
2025-05-07T20:33:05.5420155Z     
2025-05-07T20:33:05.5420389Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.5420704Z             op = silu_mul_quant
2025-05-07T20:33:05.5421041Z             if compiled:
2025-05-07T20:33:05.5421292Z                 op = torch.compile(op)
2025-05-07T20:33:05.5421589Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.5421866Z     
2025-05-07T20:33:05.5422054Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.5422222Z 
2025-05-07T20:33:05.5422323Z moe/activation_test.py:117: 
2025-05-07T20:33:05.5422619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.5423026Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.5423311Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.5424005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.5424690Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.5425222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.5425975Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.5426635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.5427198Z     kernel = self.compile(
2025-05-07T20:33:05.5427758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.5428412Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.5428811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.5429037Z 
2025-05-07T20:33:05.5429242Z self = <triton.compiler.compiler.ASTSource object at 0x7f057da5dc70>
2025-05-07T20:33:05.5430315Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.5431691Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f057c824720>}
2025-05-07T20:33:05.5433019Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.5434037Z context = <triton._C.libtriton.ir.context object at 0x7f057c588270>
2025-05-07T20:33:05.5434320Z 
2025-05-07T20:33:05.5434488Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.5435008Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.5435478Z                            module_map=module_map)
2025-05-07T20:33:05.5435844Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.5436199Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.5436471Z E       ^
2025-05-07T20:33:05.5436930Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.5437432Z 
2025-05-07T20:33:05.5437846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.5438357Z 
2025-05-07T20:33:05.5438462Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.5438879Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.5439277Z     T=1,
2025-05-07T20:33:05.5439472Z     D=7168,
2025-05-07T20:33:05.5445665Z     scale_ub=None,
2025-05-07T20:33:05.5445963Z     contiguous=True,
2025-05-07T20:33:05.5446193Z     compiled=True,
2025-05-07T20:33:05.5446405Z )
2025-05-07T20:33:05.5446814Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.5447331Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:05.5447598Z 
2025-05-07T20:33:05.5447680Z     @given(
2025-05-07T20:33:05.5447914Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.5448224Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.5448534Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.5448869Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.5449189Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.5449528Z     )
2025-05-07T20:33:05.5449881Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.5450326Z     def test_silu_mul_quant(
2025-05-07T20:33:05.5450568Z         self,
2025-05-07T20:33:05.5450771Z         T: int,
2025-05-07T20:33:05.5450974Z         D: int,
2025-05-07T20:33:05.5451186Z         scale_ub: Optional[float],
2025-05-07T20:33:05.5451465Z         contiguous: bool,
2025-05-07T20:33:05.5451707Z         compiled: bool,
2025-05-07T20:33:05.5451932Z     ) -> None:
2025-05-07T20:33:05.5452197Z         torch.manual_seed(2025)
2025-05-07T20:33:05.5452449Z     
2025-05-07T20:33:05.5452719Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.5453141Z     
2025-05-07T20:33:05.5453347Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.5453637Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.5453953Z         x = x_sign * x_clamp
2025-05-07T20:33:05.5454199Z         x0 = x[:, :D]
2025-05-07T20:33:05.5454413Z         x1 = x[:, D:]
2025-05-07T20:33:05.5454625Z     
2025-05-07T20:33:05.5454814Z         if contiguous:
2025-05-07T20:33:05.5455052Z             x0 = x0.contiguous()
2025-05-07T20:33:05.5455325Z             x1 = x1.contiguous()
2025-05-07T20:33:05.5455566Z     
2025-05-07T20:33:05.5455759Z         if scale_ub is not None:
2025-05-07T20:33:05.5456035Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.5456363Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.5456679Z             )
2025-05-07T20:33:05.5456881Z         else:
2025-05-07T20:33:05.5457112Z             scale_ub_tensor = None
2025-05-07T20:33:05.5457392Z     
2025-05-07T20:33:05.5457633Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.5457946Z             op = silu_mul_quant
2025-05-07T20:33:05.5458199Z             if compiled:
2025-05-07T20:33:05.5458454Z                 op = torch.compile(op)
2025-05-07T20:33:05.5458751Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.5459029Z     
2025-05-07T20:33:05.5459488Z         y_fp8, y_scale = fn()
2025-05-07T20:33:05.5459778Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:05.5460074Z     
2025-05-07T20:33:05.5460318Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.5460660Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:05.5460951Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:05.5461271Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:05.5461637Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.5461946Z     
2025-05-07T20:33:05.5462154Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:05.5462347Z 
2025-05-07T20:33:05.5462458Z moe/activation_test.py:126: 
2025-05-07T20:33:05.5462751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.5463095Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:05.5463425Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.5464217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:05.5464958Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:05.5465594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.5466335Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.5467014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:05.5467732Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:05.5468456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:05.5469166Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:05.5469757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:05.5470271Z     fn()
2025-05-07T20:33:05.5470782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:05.5471362Z     self.fn.run(
2025-05-07T20:33:05.5471894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.5472429Z     kernel = self.compile(
2025-05-07T20:33:05.5472967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.5473606Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.5474000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.5474238Z 
2025-05-07T20:33:05.5474444Z self = <triton.compiler.compiler.ASTSource object at 0x7f057cab73e0>
2025-05-07T20:33:05.5475514Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.5476876Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f057c8242c0>}
2025-05-07T20:33:05.5478247Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.5479262Z context = <triton._C.libtriton.ir.context object at 0x7f0553d6cb70>
2025-05-07T20:33:05.5479545Z 
2025-05-07T20:33:05.5479720Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.5480248Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.5480712Z                            module_map=module_map)
2025-05-07T20:33:05.5481087Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.5481450Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:05.5481722Z E       ^
2025-05-07T20:33:05.5482195Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.5482642Z 
2025-05-07T20:33:05.5483065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.5483572Z 
2025-05-07T20:33:05.5483688Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.5484102Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.5484512Z     T=4096,
2025-05-07T20:33:05.5484711Z     D=5120,
2025-05-07T20:33:05.5484906Z     scale_ub=None,
2025-05-07T20:33:05.5485133Z     contiguous=False,
2025-05-07T20:33:05.5485368Z     compiled=False,
2025-05-07T20:33:05.5485576Z )
2025-05-07T20:33:06.3419518Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.3421011Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:06.3421925Z 
2025-05-07T20:33:06.3422088Z     @given(
2025-05-07T20:33:06.3422657Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.3423290Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.3423897Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.3424542Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.3425188Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.3425755Z     )
2025-05-07T20:33:06.3426434Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.3427390Z     def test_silu_mul_quant(
2025-05-07T20:33:06.3427637Z         self,
2025-05-07T20:33:06.3427838Z         T: int,
2025-05-07T20:33:06.3428041Z         D: int,
2025-05-07T20:33:06.3428266Z         scale_ub: Optional[float],
2025-05-07T20:33:06.3428545Z         contiguous: bool,
2025-05-07T20:33:06.3428787Z         compiled: bool,
2025-05-07T20:33:06.3429020Z     ) -> None:
2025-05-07T20:33:06.3429239Z         torch.manual_seed(2025)
2025-05-07T20:33:06.3429481Z     
2025-05-07T20:33:06.3429827Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.3430175Z     
2025-05-07T20:33:06.3430370Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.3430664Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.3430979Z         x = x_sign * x_clamp
2025-05-07T20:33:06.3431219Z         x0 = x[:, :D]
2025-05-07T20:33:06.3431445Z         x1 = x[:, D:]
2025-05-07T20:33:06.3431663Z     
2025-05-07T20:33:06.3431852Z         if contiguous:
2025-05-07T20:33:06.3432095Z             x0 = x0.contiguous()
2025-05-07T20:33:06.3432358Z             x1 = x1.contiguous()
2025-05-07T20:33:06.3432596Z     
2025-05-07T20:33:06.3432794Z         if scale_ub is not None:
2025-05-07T20:33:06.3433078Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.3433416Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.3433723Z             )
2025-05-07T20:33:06.3433923Z         else:
2025-05-07T20:33:06.3434140Z             scale_ub_tensor = None
2025-05-07T20:33:06.3434392Z     
2025-05-07T20:33:06.3434632Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.3434952Z             op = silu_mul_quant
2025-05-07T20:33:06.3435205Z             if compiled:
2025-05-07T20:33:06.3435457Z                 op = torch.compile(op)
2025-05-07T20:33:06.3435755Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.3436027Z     
2025-05-07T20:33:06.3436226Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.3436392Z 
2025-05-07T20:33:06.3436503Z moe/activation_test.py:117: 
2025-05-07T20:33:06.3436799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.3437137Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.3437446Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.3438169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.3438851Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.3439389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.3440065Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.3440729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.3441254Z     kernel = self.compile(
2025-05-07T20:33:06.3441800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.3442450Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.3442842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.3443071Z 
2025-05-07T20:33:06.3443331Z self = <triton.compiler.compiler.ASTSource object at 0x7f057c41c1a0>
2025-05-07T20:33:06.3444439Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.3445795Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f057d937240>}
2025-05-07T20:33:06.3447119Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.3448217Z context = <triton._C.libtriton.ir.context object at 0x7f0553b812f0>
2025-05-07T20:33:06.3448505Z 
2025-05-07T20:33:06.3448670Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.3449185Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.3449690Z                            module_map=module_map)
2025-05-07T20:33:06.3450053Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.3450403Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.3450662Z E       ^
2025-05-07T20:33:06.3451118Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.3451565Z 
2025-05-07T20:33:06.3451977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.3452490Z 
2025-05-07T20:33:06.3452594Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.3453095Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.3453489Z     T=4096,
2025-05-07T20:33:06.3453682Z     D=7168,
2025-05-07T20:33:06.3453883Z     scale_ub=None,
2025-05-07T20:33:06.3454095Z     contiguous=False,
2025-05-07T20:33:06.3454327Z     compiled=False,
2025-05-07T20:33:06.3454535Z )
2025-05-07T20:33:06.3454854Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.3455353Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:06.3455622Z 
2025-05-07T20:33:06.3455705Z     @given(
2025-05-07T20:33:06.3455932Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.3456246Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.3456556Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.3456876Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.3457205Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.3457523Z     )
2025-05-07T20:33:06.3457896Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.3458331Z     def test_silu_mul_quant(
2025-05-07T20:33:06.3458576Z         self,
2025-05-07T20:33:06.3458775Z         T: int,
2025-05-07T20:33:06.3458969Z         D: int,
2025-05-07T20:33:06.3459365Z         scale_ub: Optional[float],
2025-05-07T20:33:06.3459646Z         contiguous: bool,
2025-05-07T20:33:06.3459882Z         compiled: bool,
2025-05-07T20:33:06.3460110Z     ) -> None:
2025-05-07T20:33:06.3460326Z         torch.manual_seed(2025)
2025-05-07T20:33:06.3460564Z     
2025-05-07T20:33:06.3460835Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.3461177Z     
2025-05-07T20:33:06.3461368Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.3461658Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.3461965Z         x = x_sign * x_clamp
2025-05-07T20:33:06.3462198Z         x0 = x[:, :D]
2025-05-07T20:33:06.3462416Z         x1 = x[:, D:]
2025-05-07T20:33:06.3462624Z     
2025-05-07T20:33:06.3462808Z         if contiguous:
2025-05-07T20:33:06.3463041Z             x0 = x0.contiguous()
2025-05-07T20:33:06.3463378Z             x1 = x1.contiguous()
2025-05-07T20:33:06.3463618Z     
2025-05-07T20:33:06.3463868Z         if scale_ub is not None:
2025-05-07T20:33:06.3464148Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.3464483Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.3464790Z             )
2025-05-07T20:33:06.3464985Z         else:
2025-05-07T20:33:06.3465196Z             scale_ub_tensor = None
2025-05-07T20:33:06.3465443Z     
2025-05-07T20:33:06.3465679Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.3466052Z             op = silu_mul_quant
2025-05-07T20:33:06.3466297Z             if compiled:
2025-05-07T20:33:06.3466548Z                 op = torch.compile(op)
2025-05-07T20:33:06.3466844Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.3467111Z     
2025-05-07T20:33:06.3467305Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.3467469Z 
2025-05-07T20:33:06.3467579Z moe/activation_test.py:117: 
2025-05-07T20:33:06.3467878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.3468262Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.3468545Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.3469226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.3469901Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.3470435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.3471111Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.3471771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.3472292Z     kernel = self.compile(
2025-05-07T20:33:06.3472830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.3473483Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.3473873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.3474106Z 
2025-05-07T20:33:06.3474312Z self = <triton.compiler.compiler.ASTSource object at 0x7f057c85b8f0>
2025-05-07T20:33:06.3475379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.3476728Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f057c1918a0>}
2025-05-07T20:33:06.3478048Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.3479057Z context = <triton._C.libtriton.ir.context object at 0x7f0553be0670>
2025-05-07T20:33:06.3479345Z 
2025-05-07T20:33:06.3479510Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.3480027Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.3480492Z                            module_map=module_map)
2025-05-07T20:33:06.3480853Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.3481208Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.3481464Z E       ^
2025-05-07T20:33:06.3481915Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.3482362Z 
2025-05-07T20:33:06.3482772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.3483407Z 
2025-05-07T20:33:06.3483511Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.3483963Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.3484357Z     T=128,
2025-05-07T20:33:06.3484548Z     D=7168,
2025-05-07T20:33:06.3484745Z     scale_ub=None,
2025-05-07T20:33:06.3484954Z     contiguous=False,
2025-05-07T20:33:06.3485183Z     compiled=True,
2025-05-07T20:33:06.3485394Z )
2025-05-07T20:33:06.4040283Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.4041676Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:06.4042217Z 
2025-05-07T20:33:06.4042383Z     @given(
2025-05-07T20:33:06.4042855Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.4043482Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.4044089Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.4044752Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.4045412Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.4045988Z     )
2025-05-07T20:33:06.4046819Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.4047467Z     def test_silu_mul_quant(
2025-05-07T20:33:06.4047705Z         self,
2025-05-07T20:33:06.4047903Z         T: int,
2025-05-07T20:33:06.4048099Z         D: int,
2025-05-07T20:33:06.4048314Z         scale_ub: Optional[float],
2025-05-07T20:33:06.4048584Z         contiguous: bool,
2025-05-07T20:33:06.4048828Z         compiled: bool,
2025-05-07T20:33:06.4049050Z     ) -> None:
2025-05-07T20:33:06.4049267Z         torch.manual_seed(2025)
2025-05-07T20:33:06.4049511Z     
2025-05-07T20:33:06.4049785Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.4050124Z     
2025-05-07T20:33:06.4050325Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.4050619Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.4050929Z         x = x_sign * x_clamp
2025-05-07T20:33:06.4051174Z         x0 = x[:, :D]
2025-05-07T20:33:06.4051402Z         x1 = x[:, D:]
2025-05-07T20:33:06.4051612Z     
2025-05-07T20:33:06.4051805Z         if contiguous:
2025-05-07T20:33:06.4052036Z             x0 = x0.contiguous()
2025-05-07T20:33:06.4052287Z             x1 = x1.contiguous()
2025-05-07T20:33:06.4052528Z     
2025-05-07T20:33:06.4052717Z         if scale_ub is not None:
2025-05-07T20:33:06.4053075Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.4053413Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.4053724Z             )
2025-05-07T20:33:06.4053909Z         else:
2025-05-07T20:33:06.4054121Z             scale_ub_tensor = None
2025-05-07T20:33:06.4054372Z     
2025-05-07T20:33:06.4054602Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.4054910Z             op = silu_mul_quant
2025-05-07T20:33:06.4055163Z             if compiled:
2025-05-07T20:33:06.4055412Z                 op = torch.compile(op)
2025-05-07T20:33:06.4055703Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.4055977Z     
2025-05-07T20:33:06.4056163Z         y_fp8, y_scale = fn()
2025-05-07T20:33:06.4056446Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:06.4056735Z     
2025-05-07T20:33:06.4056973Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.4057297Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:06.4057588Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:06.4057898Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:06.4058246Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:06.4058556Z     
2025-05-07T20:33:06.4058758Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:06.4058948Z 
2025-05-07T20:33:06.4059052Z moe/activation_test.py:126: 
2025-05-07T20:33:06.4059569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.4059900Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:06.4060320Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:06.4061094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:06.4061835Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:06.4062375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.4063105Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.4063771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:06.4064478Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:06.4065200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:06.4065887Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:06.4066477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:06.4066984Z     fn()
2025-05-07T20:33:06.4067514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:06.4068104Z     self.fn.run(
2025-05-07T20:33:06.4068570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.4069091Z     kernel = self.compile(
2025-05-07T20:33:06.4069627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.4070265Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.4070658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.4070882Z 
2025-05-07T20:33:06.4071099Z self = <triton.compiler.compiler.ASTSource object at 0x7f057c9c54c0>
2025-05-07T20:33:06.4072160Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.4073497Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f057c0f5a80>}
2025-05-07T20:33:06.4074810Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.4075811Z context = <triton._C.libtriton.ir.context object at 0x7f0553e3d3f0>
2025-05-07T20:33:06.4076095Z 
2025-05-07T20:33:06.4076266Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.4076771Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.4077231Z                            module_map=module_map)
2025-05-07T20:33:06.4077595Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.4077996Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:06.4078255Z E       ^
2025-05-07T20:33:06.4078712Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.4079154Z 
2025-05-07T20:33:06.4079565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.4080064Z 
2025-05-07T20:33:06.4080172Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.4080650Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.4081043Z     T=128,
2025-05-07T20:33:06.4081229Z     D=7168,
2025-05-07T20:33:06.4081465Z     scale_ub=None,
2025-05-07T20:33:06.4081679Z     contiguous=False,
2025-05-07T20:33:06.4081904Z     compiled=False,
2025-05-07T20:33:06.4082103Z )
2025-05-07T20:33:06.6037300Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.6037991Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:06.6038407Z 
2025-05-07T20:33:06.6038657Z     @given(
2025-05-07T20:33:06.6038974Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.6044790Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.6045134Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.6045469Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.6045793Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.6046084Z     )
2025-05-07T20:33:06.6046433Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.6046988Z     def test_silu_mul_quant(
2025-05-07T20:33:06.6047250Z         self,
2025-05-07T20:33:06.6047467Z         T: int,
2025-05-07T20:33:06.6047690Z         D: int,
2025-05-07T20:33:06.6047911Z         scale_ub: Optional[float],
2025-05-07T20:33:06.6048177Z         contiguous: bool,
2025-05-07T20:33:06.6048424Z         compiled: bool,
2025-05-07T20:33:06.6048651Z     ) -> None:
2025-05-07T20:33:06.6048861Z         torch.manual_seed(2025)
2025-05-07T20:33:06.6049119Z     
2025-05-07T20:33:06.6049398Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.6049737Z     
2025-05-07T20:33:06.6049937Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.6050233Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.6050533Z         x = x_sign * x_clamp
2025-05-07T20:33:06.6050777Z         x0 = x[:, :D]
2025-05-07T20:33:06.6050993Z         x1 = x[:, D:]
2025-05-07T20:33:06.6051197Z     
2025-05-07T20:33:06.6051393Z         if contiguous:
2025-05-07T20:33:06.6051633Z             x0 = x0.contiguous()
2025-05-07T20:33:06.6051886Z             x1 = x1.contiguous()
2025-05-07T20:33:06.6052125Z     
2025-05-07T20:33:06.6052320Z         if scale_ub is not None:
2025-05-07T20:33:06.6052594Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.6052920Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.6053322Z             )
2025-05-07T20:33:06.6053520Z         else:
2025-05-07T20:33:06.6053726Z             scale_ub_tensor = None
2025-05-07T20:33:06.6053984Z     
2025-05-07T20:33:06.6054213Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.6054520Z             op = silu_mul_quant
2025-05-07T20:33:06.6054771Z             if compiled:
2025-05-07T20:33:06.6055021Z                 op = torch.compile(op)
2025-05-07T20:33:06.6055307Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.6055581Z     
2025-05-07T20:33:06.6055775Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.6055940Z 
2025-05-07T20:33:06.6056053Z moe/activation_test.py:117: 
2025-05-07T20:33:06.6056345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.6056673Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.6056952Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.6057634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.6058318Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.6058850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.6059712Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.6060361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.6060967Z     kernel = self.compile(
2025-05-07T20:33:06.6061569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.6062211Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.6062604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.6062835Z 
2025-05-07T20:33:06.6063045Z self = <triton.compiler.compiler.ASTSource object at 0x7f057c858da0>
2025-05-07T20:33:06.6064168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.6065516Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0553f30860>}
2025-05-07T20:33:06.6066896Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.6067906Z context = <triton._C.libtriton.ir.context object at 0x7f0553e8a2b0>
2025-05-07T20:33:06.6068192Z 
2025-05-07T20:33:06.6068366Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.6068880Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.6069348Z                            module_map=module_map)
2025-05-07T20:33:06.6069717Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.6070074Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.6070331Z E       ^
2025-05-07T20:33:06.6070794Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.6071240Z 
2025-05-07T20:33:06.6071663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.6072162Z 
2025-05-07T20:33:06.6072272Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.6072676Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.6073074Z     T=4096,
2025-05-07T20:33:06.6073264Z     D=5120,
2025-05-07T20:33:06.6073453Z     scale_ub=1200.0,
2025-05-07T20:33:06.6073677Z     contiguous=True,
2025-05-07T20:33:06.6073901Z     compiled=False,
2025-05-07T20:33:06.6074101Z )
2025-05-07T20:33:06.6074419Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.6074907Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:06.6075171Z 
2025-05-07T20:33:06.6075254Z     @given(
2025-05-07T20:33:06.6075479Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.6075792Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.6076097Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.6076426Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.6076755Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.6077039Z     )
2025-05-07T20:33:06.6077383Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.6077871Z     def test_silu_mul_quant(
2025-05-07T20:33:06.6078110Z         self,
2025-05-07T20:33:06.6078306Z         T: int,
2025-05-07T20:33:06.6078505Z         D: int,
2025-05-07T20:33:06.6078728Z         scale_ub: Optional[float],
2025-05-07T20:33:06.6078991Z         contiguous: bool,
2025-05-07T20:33:06.6079237Z         compiled: bool,
2025-05-07T20:33:06.6079463Z     ) -> None:
2025-05-07T20:33:06.6079677Z         torch.manual_seed(2025)
2025-05-07T20:33:06.6079926Z     
2025-05-07T20:33:06.6080263Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.6080602Z     
2025-05-07T20:33:06.6080797Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.6081138Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.6081457Z         x = x_sign * x_clamp
2025-05-07T20:33:06.6081695Z         x0 = x[:, :D]
2025-05-07T20:33:06.6081917Z         x1 = x[:, D:]
2025-05-07T20:33:06.6082125Z     
2025-05-07T20:33:06.6082311Z         if contiguous:
2025-05-07T20:33:06.6082548Z             x0 = x0.contiguous()
2025-05-07T20:33:06.6082816Z             x1 = x1.contiguous()
2025-05-07T20:33:06.6083104Z     
2025-05-07T20:33:06.6083309Z         if scale_ub is not None:
2025-05-07T20:33:06.6083590Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.6083922Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.6084231Z             )
2025-05-07T20:33:06.6084425Z         else:
2025-05-07T20:33:06.6084631Z             scale_ub_tensor = None
2025-05-07T20:33:06.6084890Z     
2025-05-07T20:33:06.6085123Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.6085435Z             op = silu_mul_quant
2025-05-07T20:33:06.6085719Z             if compiled:
2025-05-07T20:33:06.6085966Z                 op = torch.compile(op)
2025-05-07T20:33:06.6086267Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.6086532Z     
2025-05-07T20:33:06.6086730Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.6086894Z 
2025-05-07T20:33:06.6086999Z moe/activation_test.py:117: 
2025-05-07T20:33:06.6087289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.6087617Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.6087896Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.6088569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.6089249Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.6089781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.6090456Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.6091101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.6091623Z     kernel = self.compile(
2025-05-07T20:33:06.6092158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.6092806Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.6093243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.6093473Z 
2025-05-07T20:33:06.6093678Z self = <triton.compiler.compiler.ASTSource object at 0x7f057c858ec0>
2025-05-07T20:33:06.6094742Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.6096084Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0553f314e0>}
2025-05-07T20:33:06.6097393Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.6098399Z context = <triton._C.libtriton.ir.context object at 0x7f055323eff0>
2025-05-07T20:33:06.6098685Z 
2025-05-07T20:33:06.6098849Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.6099365Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.6099904Z                            module_map=module_map)
2025-05-07T20:33:06.6100267Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.6100663Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.6100921Z E       ^
2025-05-07T20:33:06.6101386Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.6101842Z 
2025-05-07T20:33:06.6102262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.6102811Z 
2025-05-07T20:33:06.6102921Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.6103339Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.6103740Z     T=1,
2025-05-07T20:33:06.6103925Z     D=5120,
2025-05-07T20:33:06.6104121Z     scale_ub=None,
2025-05-07T20:33:06.6104334Z     contiguous=True,
2025-05-07T20:33:06.6104560Z     compiled=True,
2025-05-07T20:33:06.6104769Z )
2025-05-07T20:33:06.9902452Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.9903327Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:06.9903686Z 
2025-05-07T20:33:06.9903797Z     @given(
2025-05-07T20:33:06.9904116Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.9904430Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.9904739Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.9905071Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.9905399Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.9905685Z     )
2025-05-07T20:33:06.9906037Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.9906481Z     def test_silu_mul_quant(
2025-05-07T20:33:06.9906721Z         self,
2025-05-07T20:33:06.9906917Z         T: int,
2025-05-07T20:33:06.9907120Z         D: int,
2025-05-07T20:33:06.9907340Z         scale_ub: Optional[float],
2025-05-07T20:33:06.9907635Z         contiguous: bool,
2025-05-07T20:33:06.9907909Z         compiled: bool,
2025-05-07T20:33:06.9908135Z     ) -> None:
2025-05-07T20:33:06.9908355Z         torch.manual_seed(2025)
2025-05-07T20:33:06.9908600Z     
2025-05-07T20:33:06.9908868Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.9909203Z     
2025-05-07T20:33:06.9909392Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.9909675Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.9909988Z         x = x_sign * x_clamp
2025-05-07T20:33:06.9910229Z         x0 = x[:, :D]
2025-05-07T20:33:06.9910439Z         x1 = x[:, D:]
2025-05-07T20:33:06.9910642Z     
2025-05-07T20:33:06.9910827Z         if contiguous:
2025-05-07T20:33:06.9911058Z             x0 = x0.contiguous()
2025-05-07T20:33:06.9911308Z             x1 = x1.contiguous()
2025-05-07T20:33:06.9911558Z     
2025-05-07T20:33:06.9911750Z         if scale_ub is not None:
2025-05-07T20:33:06.9912027Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.9912364Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.9912674Z             )
2025-05-07T20:33:06.9912868Z         else:
2025-05-07T20:33:06.9913082Z             scale_ub_tensor = None
2025-05-07T20:33:06.9913333Z     
2025-05-07T20:33:06.9913563Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.9913869Z             op = silu_mul_quant
2025-05-07T20:33:06.9914117Z             if compiled:
2025-05-07T20:33:06.9914362Z                 op = torch.compile(op)
2025-05-07T20:33:06.9914653Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.9914927Z     
2025-05-07T20:33:06.9915115Z         y_fp8, y_scale = fn()
2025-05-07T20:33:06.9915399Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:06.9915683Z     
2025-05-07T20:33:06.9915914Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.9916329Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:06.9916617Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:06.9916984Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:06.9917344Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:06.9917647Z     
2025-05-07T20:33:06.9917861Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:06.9918082Z 
2025-05-07T20:33:06.9918185Z moe/activation_test.py:126: 
2025-05-07T20:33:06.9918480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.9918873Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:06.9919188Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:06.9919964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:06.9920703Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:06.9921245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.9921952Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.9922629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:06.9923335Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:06.9924052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:06.9924678Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:06.9925270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:06.9925781Z     fn()
2025-05-07T20:33:06.9926276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:06.9926848Z     self.fn.run(
2025-05-07T20:33:06.9927318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.9927853Z     kernel = self.compile(
2025-05-07T20:33:06.9928426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.9929066Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.9929462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.9929688Z 
2025-05-07T20:33:06.9929891Z self = <triton.compiler.compiler.ASTSource object at 0x7f0553dc6b10>
2025-05-07T20:33:06.9930958Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.9932315Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0553f32d40>}
2025-05-07T20:33:06.9933706Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.9934724Z context = <triton._C.libtriton.ir.context object at 0x7f05531db170>
2025-05-07T20:33:06.9935008Z 
2025-05-07T20:33:06.9935176Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.9935688Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.9936157Z                            module_map=module_map)
2025-05-07T20:33:06.9936523Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.9936926Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:06.9937196Z E       ^
2025-05-07T20:33:06.9937706Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.9938199Z 
2025-05-07T20:33:06.9938617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.9939121Z 
2025-05-07T20:33:06.9939225Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.9939632Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.9940063Z     T=2048,
2025-05-07T20:33:06.9940244Z     D=5120,
2025-05-07T20:33:06.9940439Z     scale_ub=None,
2025-05-07T20:33:06.9940653Z     contiguous=True,
2025-05-07T20:33:06.9940869Z     compiled=True,
2025-05-07T20:33:06.9941071Z )
2025-05-07T20:33:07.3588709Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.3589470Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.3589844Z 
2025-05-07T20:33:07.3589962Z     @given(
2025-05-07T20:33:07.3590420Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.3590832Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.3591240Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.3591618Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.3591941Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.3592228Z     )
2025-05-07T20:33:07.3592568Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.3593009Z     def test_silu_mul_quant(
2025-05-07T20:33:07.3593249Z         self,
2025-05-07T20:33:07.3593447Z         T: int,
2025-05-07T20:33:07.3593644Z         D: int,
2025-05-07T20:33:07.3593861Z         scale_ub: Optional[float],
2025-05-07T20:33:07.3594126Z         contiguous: bool,
2025-05-07T20:33:07.3594360Z         compiled: bool,
2025-05-07T20:33:07.3594585Z     ) -> None:
2025-05-07T20:33:07.3594797Z         torch.manual_seed(2025)
2025-05-07T20:33:07.3595030Z     
2025-05-07T20:33:07.3595300Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.3595643Z     
2025-05-07T20:33:07.3595831Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.3596119Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.3596420Z         x = x_sign * x_clamp
2025-05-07T20:33:07.3596652Z         x0 = x[:, :D]
2025-05-07T20:33:07.3596872Z         x1 = x[:, D:]
2025-05-07T20:33:07.3597086Z     
2025-05-07T20:33:07.3597268Z         if contiguous:
2025-05-07T20:33:07.3597499Z             x0 = x0.contiguous()
2025-05-07T20:33:07.3597756Z             x1 = x1.contiguous()
2025-05-07T20:33:07.3597991Z     
2025-05-07T20:33:07.3598187Z         if scale_ub is not None:
2025-05-07T20:33:07.3598463Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.3598795Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.3599097Z             )
2025-05-07T20:33:07.3599295Z         else:
2025-05-07T20:33:07.3599512Z             scale_ub_tensor = None
2025-05-07T20:33:07.3599762Z     
2025-05-07T20:33:07.3599995Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.3600308Z             op = silu_mul_quant
2025-05-07T20:33:07.3600553Z             if compiled:
2025-05-07T20:33:07.3600806Z                 op = torch.compile(op)
2025-05-07T20:33:07.3601105Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.3601375Z     
2025-05-07T20:33:07.3601566Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.3601850Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.3602135Z     
2025-05-07T20:33:07.3602376Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.3602712Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.3602999Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.3603380Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.3603732Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.3604107Z     
2025-05-07T20:33:07.3604307Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.3604501Z 
2025-05-07T20:33:07.3604602Z moe/activation_test.py:126: 
2025-05-07T20:33:07.3604899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.3605228Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.3605549Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.3606421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.3607161Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.3607695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.3608421Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.3609142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.3609855Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.3610565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.3611192Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.3611789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.3612291Z     fn()
2025-05-07T20:33:07.3612792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.3613454Z     self.fn.run(
2025-05-07T20:33:07.3613919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.3614444Z     kernel = self.compile(
2025-05-07T20:33:07.3614986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.3615629Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.3616017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.3616248Z 
2025-05-07T20:33:07.3616454Z self = <triton.compiler.compiler.ASTSource object at 0x7f055364f020>
2025-05-07T20:33:07.3617519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.3618918Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f057c1ede40>}
2025-05-07T20:33:07.3620242Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.3621247Z context = <triton._C.libtriton.ir.context object at 0x7f055302abf0>
2025-05-07T20:33:07.3621537Z 
2025-05-07T20:33:07.3621701Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.3622212Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.3622673Z                            module_map=module_map)
2025-05-07T20:33:07.3623028Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.3623381Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.3623644Z E       ^
2025-05-07T20:33:07.3624100Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.3624604Z 
2025-05-07T20:33:07.3625051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.3625554Z 
2025-05-07T20:33:07.3625655Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.3626061Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.3626449Z     T=128,
2025-05-07T20:33:07.3626636Z     D=5120,
2025-05-07T20:33:07.3626825Z     scale_ub=None,
2025-05-07T20:33:07.3627075Z     contiguous=True,
2025-05-07T20:33:07.3627296Z     compiled=True,
2025-05-07T20:33:07.3627498Z )
2025-05-07T20:33:07.7888902Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.7889673Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.7890031Z 
2025-05-07T20:33:07.7890146Z     @given(
2025-05-07T20:33:07.7890414Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.7890730Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.7891163Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.7891501Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.7891826Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.7892114Z     )
2025-05-07T20:33:07.7892460Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.7892897Z     def test_silu_mul_quant(
2025-05-07T20:33:07.7893218Z         self,
2025-05-07T20:33:07.7893420Z         T: int,
2025-05-07T20:33:07.7893611Z         D: int,
2025-05-07T20:33:07.7893840Z         scale_ub: Optional[float],
2025-05-07T20:33:07.7899167Z         contiguous: bool,
2025-05-07T20:33:07.7899459Z         compiled: bool,
2025-05-07T20:33:07.7899691Z     ) -> None:
2025-05-07T20:33:07.7899915Z         torch.manual_seed(2025)
2025-05-07T20:33:07.7900159Z     
2025-05-07T20:33:07.7900454Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.7900806Z     
2025-05-07T20:33:07.7901008Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.7901308Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.7901622Z         x = x_sign * x_clamp
2025-05-07T20:33:07.7901867Z         x0 = x[:, :D]
2025-05-07T20:33:07.7902094Z         x1 = x[:, D:]
2025-05-07T20:33:07.7902310Z     
2025-05-07T20:33:07.7902496Z         if contiguous:
2025-05-07T20:33:07.7902737Z             x0 = x0.contiguous()
2025-05-07T20:33:07.7903007Z             x1 = x1.contiguous()
2025-05-07T20:33:07.7903257Z     
2025-05-07T20:33:07.7903461Z         if scale_ub is not None:
2025-05-07T20:33:07.7903735Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.7904079Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.7904387Z             )
2025-05-07T20:33:07.7904590Z         else:
2025-05-07T20:33:07.7904801Z             scale_ub_tensor = None
2025-05-07T20:33:07.7905054Z     
2025-05-07T20:33:07.7905298Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.7905628Z             op = silu_mul_quant
2025-05-07T20:33:07.7905876Z             if compiled:
2025-05-07T20:33:07.7906129Z                 op = torch.compile(op)
2025-05-07T20:33:07.7906431Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.7906708Z     
2025-05-07T20:33:07.7906908Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.7907197Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.7907490Z     
2025-05-07T20:33:07.7907732Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.7908085Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.7908404Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.7908714Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.7909072Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.7909503Z     
2025-05-07T20:33:07.7909707Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.7909907Z 
2025-05-07T20:33:07.7910070Z moe/activation_test.py:126: 
2025-05-07T20:33:07.7910366Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.7910693Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.7911016Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.7911794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.7912599Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.7913132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.7913803Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.7914479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.7915236Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.7915948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.7916580Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.7917176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.7917683Z     fn()
2025-05-07T20:33:07.7918187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.7918810Z     self.fn.run(
2025-05-07T20:33:07.7919272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.7919798Z     kernel = self.compile(
2025-05-07T20:33:07.7920331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.7920972Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.7921359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.7921589Z 
2025-05-07T20:33:07.7921793Z self = <triton.compiler.compiler.ASTSource object at 0x7f057da5e3f0>
2025-05-07T20:33:07.7922863Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.7924213Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552f0dd00>}
2025-05-07T20:33:07.7925524Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.7926531Z context = <triton._C.libtriton.ir.context object at 0x7f05535ffa70>
2025-05-07T20:33:07.7926822Z 
2025-05-07T20:33:07.7926984Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.7927499Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.7927961Z                            module_map=module_map)
2025-05-07T20:33:07.7928349Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.7928725Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.7928994Z E       ^
2025-05-07T20:33:07.7929447Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.7929893Z 
2025-05-07T20:33:07.7930300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.7930856Z 
2025-05-07T20:33:07.7930993Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.7931404Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.7931800Z     T=4096,
2025-05-07T20:33:07.7931989Z     D=5120,
2025-05-07T20:33:07.7932178Z     scale_ub=None,
2025-05-07T20:33:07.7932386Z     contiguous=True,
2025-05-07T20:33:07.7932609Z     compiled=True,
2025-05-07T20:33:07.7932817Z )
2025-05-07T20:33:08.2234585Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.2235456Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:08.2235818Z 
2025-05-07T20:33:08.2235930Z     @given(
2025-05-07T20:33:08.2236217Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.2236532Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.2236841Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.2237173Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.2237585Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.2237874Z     )
2025-05-07T20:33:08.2238218Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.2238653Z     def test_silu_mul_quant(
2025-05-07T20:33:08.2238894Z         self,
2025-05-07T20:33:08.2239084Z         T: int,
2025-05-07T20:33:08.2239280Z         D: int,
2025-05-07T20:33:08.2239500Z         scale_ub: Optional[float],
2025-05-07T20:33:08.2239767Z         contiguous: bool,
2025-05-07T20:33:08.2240007Z         compiled: bool,
2025-05-07T20:33:08.2240232Z     ) -> None:
2025-05-07T20:33:08.2240441Z         torch.manual_seed(2025)
2025-05-07T20:33:08.2240682Z     
2025-05-07T20:33:08.2240963Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.2241297Z     
2025-05-07T20:33:08.2241498Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.2241788Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.2242097Z         x = x_sign * x_clamp
2025-05-07T20:33:08.2242345Z         x0 = x[:, :D]
2025-05-07T20:33:08.2242559Z         x1 = x[:, D:]
2025-05-07T20:33:08.2242762Z     
2025-05-07T20:33:08.2242945Z         if contiguous:
2025-05-07T20:33:08.2243180Z             x0 = x0.contiguous()
2025-05-07T20:33:08.2243448Z             x1 = x1.contiguous()
2025-05-07T20:33:08.2243680Z     
2025-05-07T20:33:08.2243874Z         if scale_ub is not None:
2025-05-07T20:33:08.2244150Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.2244476Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.2244785Z             )
2025-05-07T20:33:08.2244981Z         else:
2025-05-07T20:33:08.2245192Z             scale_ub_tensor = None
2025-05-07T20:33:08.2245445Z     
2025-05-07T20:33:08.2245683Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.2245995Z             op = silu_mul_quant
2025-05-07T20:33:08.2246248Z             if compiled:
2025-05-07T20:33:08.2246494Z                 op = torch.compile(op)
2025-05-07T20:33:08.2246795Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.2247070Z     
2025-05-07T20:33:08.2247267Z         y_fp8, y_scale = fn()
2025-05-07T20:33:08.2247552Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:08.2247836Z     
2025-05-07T20:33:08.2248082Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.2248420Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:08.2248705Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:08.2249020Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:08.2249376Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.2249680Z     
2025-05-07T20:33:08.2249879Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:08.2250069Z 
2025-05-07T20:33:08.2250250Z moe/activation_test.py:126: 
2025-05-07T20:33:08.2250547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.2250967Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:08.2251296Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.2252076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:08.2252818Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:08.2253463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.2254189Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.2254871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:08.2255579Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.2256305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:08.2256981Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:08.2257584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:08.2258093Z     fn()
2025-05-07T20:33:08.2258599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:08.2259376Z     self.fn.run(
2025-05-07T20:33:08.2259916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.2260452Z     kernel = self.compile(
2025-05-07T20:33:08.2260988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.2261640Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.2262031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.2262265Z 
2025-05-07T20:33:08.2262469Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552f3f680>
2025-05-07T20:33:08.2263539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.2264884Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f055335b4c0>}
2025-05-07T20:33:08.2266199Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.2267205Z context = <triton._C.libtriton.ir.context object at 0x7f0552b6eef0>
2025-05-07T20:33:08.2267495Z 
2025-05-07T20:33:08.2267664Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.2268176Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.2268634Z                            module_map=module_map)
2025-05-07T20:33:08.2269002Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.2269354Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:08.2269623Z E       ^
2025-05-07T20:33:08.2270075Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.2270522Z 
2025-05-07T20:33:08.2270932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.2271432Z 
2025-05-07T20:33:08.2271630Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.2272041Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.2272489Z     T=16384,
2025-05-07T20:33:08.2272688Z     D=5120,
2025-05-07T20:33:08.2272879Z     scale_ub=None,
2025-05-07T20:33:08.2273085Z     contiguous=True,
2025-05-07T20:33:08.2273308Z     compiled=True,
2025-05-07T20:33:08.2273517Z )
2025-05-07T20:33:08.2531555Z W0507 20:33:08.251000 99481 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:33:08.2533088Z W0507 20:33:08.251000 99481 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:33:08.2534526Z W0507 20:33:08.251000 99481 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:33:08.2535510Z W0507 20:33:08.251000 99481 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:33:08.2536649Z W0507 20:33:08.251000 99481 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:33:08.3409208Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.3409971Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:08.3410361Z 
2025-05-07T20:33:08.3410473Z     @given(
2025-05-07T20:33:08.3410761Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.3411077Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.3411389Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.3411725Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.3412054Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.3412343Z     )
2025-05-07T20:33:08.3412697Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.3413212Z     def test_silu_mul_quant(
2025-05-07T20:33:08.3413454Z         self,
2025-05-07T20:33:08.3413659Z         T: int,
2025-05-07T20:33:08.3413866Z         D: int,
2025-05-07T20:33:08.3414088Z         scale_ub: Optional[float],
2025-05-07T20:33:08.3414365Z         contiguous: bool,
2025-05-07T20:33:08.3414612Z         compiled: bool,
2025-05-07T20:33:08.3414838Z     ) -> None:
2025-05-07T20:33:08.3415059Z         torch.manual_seed(2025)
2025-05-07T20:33:08.3415304Z     
2025-05-07T20:33:08.3415578Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.3415923Z     
2025-05-07T20:33:08.3416127Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.3416414Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.3416727Z         x = x_sign * x_clamp
2025-05-07T20:33:08.3416983Z         x0 = x[:, :D]
2025-05-07T20:33:08.3417198Z         x1 = x[:, D:]
2025-05-07T20:33:08.3417411Z     
2025-05-07T20:33:08.3417606Z         if contiguous:
2025-05-07T20:33:08.3417843Z             x0 = x0.contiguous()
2025-05-07T20:33:08.3418104Z             x1 = x1.contiguous()
2025-05-07T20:33:08.3418351Z     
2025-05-07T20:33:08.3418553Z         if scale_ub is not None:
2025-05-07T20:33:08.3418827Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.3419167Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.3419488Z             )
2025-05-07T20:33:08.3419681Z         else:
2025-05-07T20:33:08.3419898Z             scale_ub_tensor = None
2025-05-07T20:33:08.3420153Z     
2025-05-07T20:33:08.3420384Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.3420700Z             op = silu_mul_quant
2025-05-07T20:33:08.3420955Z             if compiled:
2025-05-07T20:33:08.3421201Z                 op = torch.compile(op)
2025-05-07T20:33:08.3421617Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.3421893Z     
2025-05-07T20:33:08.3422154Z         y_fp8, y_scale = fn()
2025-05-07T20:33:08.3422441Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:08.3422729Z     
2025-05-07T20:33:08.3422963Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.3423291Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:08.3423579Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:08.3423890Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:08.3424308Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.3424618Z     
2025-05-07T20:33:08.3424818Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:08.3425010Z 
2025-05-07T20:33:08.3425109Z moe/activation_test.py:126: 
2025-05-07T20:33:08.3425397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.3425732Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:08.3426055Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.3426895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:08.3427638Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:08.3428185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.3428897Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.3429579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:08.3430290Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.3431005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:08.3431630Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:08.3432226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:08.3432734Z     fn()
2025-05-07T20:33:08.3433234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:08.3433798Z     self.fn.run(
2025-05-07T20:33:08.3434260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.3434786Z     kernel = self.compile(
2025-05-07T20:33:08.3435318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.3435963Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.3436353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.3436580Z 
2025-05-07T20:33:08.3436786Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552f3d1c0>
2025-05-07T20:33:08.3437849Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.3439194Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05528c9580>}
2025-05-07T20:33:08.3440517Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.3441530Z context = <triton._C.libtriton.ir.context object at 0x7f055263b470>
2025-05-07T20:33:08.3441811Z 
2025-05-07T20:33:08.3442032Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.3442572Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.3443030Z                            module_map=module_map)
2025-05-07T20:33:08.3443397Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.3443746Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:08.3444008Z E       ^
2025-05-07T20:33:08.3444462Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.3445013Z 
2025-05-07T20:33:08.3445429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.3445935Z 
2025-05-07T20:33:08.3446039Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.3446450Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.3446846Z     T=1,
2025-05-07T20:33:08.3447020Z     D=5120,
2025-05-07T20:33:08.3447214Z     scale_ub=1200.0,
2025-05-07T20:33:08.3447435Z     contiguous=True,
2025-05-07T20:33:08.3447699Z     compiled=True,
2025-05-07T20:33:08.3447899Z )
2025-05-07T20:33:08.4853429Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4854169Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:08.4854525Z 
2025-05-07T20:33:08.4854625Z     @given(
2025-05-07T20:33:08.4854852Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4855168Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4855471Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4855798Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4856125Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4856410Z     )
2025-05-07T20:33:08.4856750Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4857193Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4857436Z         self,
2025-05-07T20:33:08.4857632Z         T: int,
2025-05-07T20:33:08.4857830Z         D: int,
2025-05-07T20:33:08.4858050Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4858326Z         contiguous: bool,
2025-05-07T20:33:08.4858561Z         compiled: bool,
2025-05-07T20:33:08.4858787Z     ) -> None:
2025-05-07T20:33:08.4859006Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4859412Z     
2025-05-07T20:33:08.4859681Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4860029Z     
2025-05-07T20:33:08.4860224Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4860509Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4860818Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4861055Z         x0 = x[:, :D]
2025-05-07T20:33:08.4861272Z         x1 = x[:, D:]
2025-05-07T20:33:08.4861478Z     
2025-05-07T20:33:08.4861663Z         if contiguous:
2025-05-07T20:33:08.4861897Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4862151Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4862390Z     
2025-05-07T20:33:08.4862582Z         if scale_ub is not None:
2025-05-07T20:33:08.4862853Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4863183Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4863485Z             )
2025-05-07T20:33:08.4863685Z         else:
2025-05-07T20:33:08.4863895Z             scale_ub_tensor = None
2025-05-07T20:33:08.4864141Z     
2025-05-07T20:33:08.4864370Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4864677Z             op = silu_mul_quant
2025-05-07T20:33:08.4864920Z             if compiled:
2025-05-07T20:33:08.4865167Z                 op = torch.compile(op)
2025-05-07T20:33:08.4865460Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4865729Z     
2025-05-07T20:33:08.4865917Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.4866228Z 
2025-05-07T20:33:08.4866332Z moe/activation_test.py:117: 
2025-05-07T20:33:08.4866685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4867015Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.4867288Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4867838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.4868384Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.4869033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.4869774Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.4870298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4870964Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4871616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4872206Z     kernel = self.compile(
2025-05-07T20:33:08.4872741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4873380Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4873774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4874001Z 
2025-05-07T20:33:08.4874213Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552dafc50>
2025-05-07T20:33:08.4875272Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4876615Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0553854860>}
2025-05-07T20:33:08.4878102Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4879161Z context = <triton._C.libtriton.ir.context object at 0x7f05521f6370>
2025-05-07T20:33:08.4879445Z 
2025-05-07T20:33:08.4879607Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4880124Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4880586Z                            module_map=module_map)
2025-05-07T20:33:08.4880949Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4881298Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.4881557Z E       ^
2025-05-07T20:33:08.4882012Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4882449Z 
2025-05-07T20:33:08.4882861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4883367Z 
2025-05-07T20:33:08.4883471Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4883875Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4884268Z     T=1,
2025-05-07T20:33:08.4884446Z     D=5120,
2025-05-07T20:33:08.4884644Z     scale_ub=None,
2025-05-07T20:33:08.4884856Z     contiguous=False,
2025-05-07T20:33:08.4885075Z     compiled=True,
2025-05-07T20:33:08.4885278Z )
2025-05-07T20:33:08.7021289Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.7029652Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:08.7030016Z 
2025-05-07T20:33:08.7030257Z     @given(
2025-05-07T20:33:08.7030580Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.7031002Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.7031316Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.7031645Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.7031972Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.7032251Z     )
2025-05-07T20:33:08.7032602Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.7033111Z     def test_silu_mul_quant(
2025-05-07T20:33:08.7033346Z         self,
2025-05-07T20:33:08.7033545Z         T: int,
2025-05-07T20:33:08.7033743Z         D: int,
2025-05-07T20:33:08.7033953Z         scale_ub: Optional[float],
2025-05-07T20:33:08.7034223Z         contiguous: bool,
2025-05-07T20:33:08.7034464Z         compiled: bool,
2025-05-07T20:33:08.7034687Z     ) -> None:
2025-05-07T20:33:08.7034902Z         torch.manual_seed(2025)
2025-05-07T20:33:08.7035143Z     
2025-05-07T20:33:08.7035409Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.7035754Z     
2025-05-07T20:33:08.7036006Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.7036292Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.7036605Z         x = x_sign * x_clamp
2025-05-07T20:33:08.7036851Z         x0 = x[:, :D]
2025-05-07T20:33:08.7037060Z         x1 = x[:, D:]
2025-05-07T20:33:08.7037269Z     
2025-05-07T20:33:08.7037456Z         if contiguous:
2025-05-07T20:33:08.7037692Z             x0 = x0.contiguous()
2025-05-07T20:33:08.7037941Z             x1 = x1.contiguous()
2025-05-07T20:33:08.7038182Z     
2025-05-07T20:33:08.7038374Z         if scale_ub is not None:
2025-05-07T20:33:08.7038641Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.7038969Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.7039270Z             )
2025-05-07T20:33:08.7039459Z         else:
2025-05-07T20:33:08.7039673Z             scale_ub_tensor = None
2025-05-07T20:33:08.7039920Z     
2025-05-07T20:33:08.7040152Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.7040463Z             op = silu_mul_quant
2025-05-07T20:33:08.7040710Z             if compiled:
2025-05-07T20:33:08.7040960Z                 op = torch.compile(op)
2025-05-07T20:33:08.7041255Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.7041526Z     
2025-05-07T20:33:08.7041720Z         y_fp8, y_scale = fn()
2025-05-07T20:33:08.7042000Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:08.7042295Z     
2025-05-07T20:33:08.7042531Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.7042855Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:08.7043144Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:08.7043457Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:08.7043814Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.7044122Z     
2025-05-07T20:33:08.7044328Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:08.7044523Z 
2025-05-07T20:33:08.7044624Z moe/activation_test.py:126: 
2025-05-07T20:33:08.7044921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.7045252Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:08.7045574Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.7046350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:08.7047096Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:08.7047639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.7048317Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.7049045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:08.7049807Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.7050524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:08.7051149Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:08.7051741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:08.7052295Z     fn()
2025-05-07T20:33:08.7052801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:08.7053445Z     self.fn.run(
2025-05-07T20:33:08.7053913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.7054441Z     kernel = self.compile(
2025-05-07T20:33:08.7054988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.7055670Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.7056061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.7056289Z 
2025-05-07T20:33:08.7056503Z self = <triton.compiler.compiler.ASTSource object at 0x7f05528f40e0>
2025-05-07T20:33:08.7057576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.7059013Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0553856b60>}
2025-05-07T20:33:08.7060550Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.7061561Z context = <triton._C.libtriton.ir.context object at 0x7f05521e7170>
2025-05-07T20:33:08.7061847Z 
2025-05-07T20:33:08.7062022Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.7062537Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.7063010Z                            module_map=module_map)
2025-05-07T20:33:08.7063378Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.7063748Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:08.7064032Z E       ^
2025-05-07T20:33:08.7064498Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.7064946Z 
2025-05-07T20:33:08.7065368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.7065875Z 
2025-05-07T20:33:08.7065986Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.7066399Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.7066796Z     T=1,
2025-05-07T20:33:08.7066982Z     D=5120,
2025-05-07T20:33:08.7067178Z     scale_ub=None,
2025-05-07T20:33:08.7067399Z     contiguous=True,
2025-05-07T20:33:08.7067625Z     compiled=False,
2025-05-07T20:33:08.7067830Z )
2025-05-07T20:33:08.8550319Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.8551082Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:08.8551442Z 
2025-05-07T20:33:08.8551556Z     @given(
2025-05-07T20:33:08.8551869Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.8552306Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.8552857Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.8553319Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.8553665Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.8553954Z     )
2025-05-07T20:33:08.8554300Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.8554747Z     def test_silu_mul_quant(
2025-05-07T20:33:08.8554989Z         self,
2025-05-07T20:33:08.8555194Z         T: int,
2025-05-07T20:33:08.8555384Z         D: int,
2025-05-07T20:33:08.8555669Z         scale_ub: Optional[float],
2025-05-07T20:33:08.8555938Z         contiguous: bool,
2025-05-07T20:33:08.8556172Z         compiled: bool,
2025-05-07T20:33:08.8556394Z     ) -> None:
2025-05-07T20:33:08.8556610Z         torch.manual_seed(2025)
2025-05-07T20:33:08.8556848Z     
2025-05-07T20:33:08.8557118Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.8557462Z     
2025-05-07T20:33:08.8557648Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.8557942Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.8558343Z         x = x_sign * x_clamp
2025-05-07T20:33:08.8558584Z         x0 = x[:, :D]
2025-05-07T20:33:08.8558798Z         x1 = x[:, D:]
2025-05-07T20:33:08.8559008Z     
2025-05-07T20:33:08.8559385Z         if contiguous:
2025-05-07T20:33:08.8559625Z             x0 = x0.contiguous()
2025-05-07T20:33:08.8559892Z             x1 = x1.contiguous()
2025-05-07T20:33:08.8560127Z     
2025-05-07T20:33:08.8560327Z         if scale_ub is not None:
2025-05-07T20:33:08.8560597Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.8560933Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.8561240Z             )
2025-05-07T20:33:08.8561440Z         else:
2025-05-07T20:33:08.8561687Z             scale_ub_tensor = None
2025-05-07T20:33:08.8561941Z     
2025-05-07T20:33:08.8562171Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.8562493Z             op = silu_mul_quant
2025-05-07T20:33:08.8562740Z             if compiled:
2025-05-07T20:33:08.8562995Z                 op = torch.compile(op)
2025-05-07T20:33:08.8563289Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.8563564Z     
2025-05-07T20:33:08.8563758Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.8563920Z 
2025-05-07T20:33:08.8564019Z moe/activation_test.py:117: 
2025-05-07T20:33:08.8564320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.8564660Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.8564932Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.8565620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.8566303Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.8566839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.8567512Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.8568166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.8568743Z     kernel = self.compile(
2025-05-07T20:33:08.8569270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.8569915Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.8570322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.8570545Z 
2025-05-07T20:33:08.8570750Z self = <triton.compiler.compiler.ASTSource object at 0x7f057c1e7d40>
2025-05-07T20:33:08.8571820Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.8573350Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05538579c0>}
2025-05-07T20:33:08.8574667Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.8575673Z context = <triton._C.libtriton.ir.context object at 0x7f05521b9430>
2025-05-07T20:33:08.8576016Z 
2025-05-07T20:33:08.8576181Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.8576688Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.8577144Z                            module_map=module_map)
2025-05-07T20:33:08.8577502Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.8577850Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.8578106Z E       ^
2025-05-07T20:33:08.8578631Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.8579122Z 
2025-05-07T20:33:08.8579530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.8580037Z 
2025-05-07T20:33:08.8580139Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.8580548Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.8580936Z     T=128,
2025-05-07T20:33:08.8581124Z     D=5120,
2025-05-07T20:33:08.8581311Z     scale_ub=None,
2025-05-07T20:33:08.8581520Z     contiguous=False,
2025-05-07T20:33:08.8581738Z     compiled=True,
2025-05-07T20:33:08.8581939Z )
2025-05-07T20:33:08.8582250Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.8582729Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:08.8582998Z 
2025-05-07T20:33:08.8583076Z     @given(
2025-05-07T20:33:08.8583303Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.8583609Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.8583911Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.8584237Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.8584555Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.8584840Z     )
2025-05-07T20:33:08.8585188Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.8585617Z     def test_silu_mul_quant(
2025-05-07T20:33:08.8585866Z         self,
2025-05-07T20:33:08.8586066Z         T: int,
2025-05-07T20:33:08.8586262Z         D: int,
2025-05-07T20:33:08.8586478Z         scale_ub: Optional[float],
2025-05-07T20:33:08.8586750Z         contiguous: bool,
2025-05-07T20:33:08.8586991Z         compiled: bool,
2025-05-07T20:33:08.8587215Z     ) -> None:
2025-05-07T20:33:08.8587434Z         torch.manual_seed(2025)
2025-05-07T20:33:08.8587675Z     
2025-05-07T20:33:08.8587939Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.8588276Z     
2025-05-07T20:33:08.8588475Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.8588759Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.8589068Z         x = x_sign * x_clamp
2025-05-07T20:33:08.8589308Z         x0 = x[:, :D]
2025-05-07T20:33:08.8589524Z         x1 = x[:, D:]
2025-05-07T20:33:08.8589739Z     
2025-05-07T20:33:08.8589931Z         if contiguous:
2025-05-07T20:33:08.8590157Z             x0 = x0.contiguous()
2025-05-07T20:33:08.8590418Z             x1 = x1.contiguous()
2025-05-07T20:33:08.8590665Z     
2025-05-07T20:33:08.8590853Z         if scale_ub is not None:
2025-05-07T20:33:08.8591130Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.8591512Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.8591815Z             )
2025-05-07T20:33:08.8592044Z         else:
2025-05-07T20:33:08.8592254Z             scale_ub_tensor = None
2025-05-07T20:33:08.8592505Z     
2025-05-07T20:33:08.8592727Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.8593034Z             op = silu_mul_quant
2025-05-07T20:33:08.8593283Z             if compiled:
2025-05-07T20:33:08.8593524Z                 op = torch.compile(op)
2025-05-07T20:33:08.8593816Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.8594131Z     
2025-05-07T20:33:08.8594316Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.8594482Z 
2025-05-07T20:33:08.8594581Z moe/activation_test.py:117: 
2025-05-07T20:33:08.8594870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.8595187Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.8595466Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.8596020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.8596606Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.8597247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.8597917Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.8598448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.8599161Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.8599807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.8600325Z     kernel = self.compile(
2025-05-07T20:33:08.8600857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.8601493Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.8601885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.8602108Z 
2025-05-07T20:33:08.8602319Z self = <triton.compiler.compiler.ASTSource object at 0x7f0553818920>
2025-05-07T20:33:08.8603371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.8604708Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0553854ea0>}
2025-05-07T20:33:08.8606021Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.8607027Z context = <triton._C.libtriton.ir.context object at 0x7f0552009b70>
2025-05-07T20:33:08.8607310Z 
2025-05-07T20:33:08.8607476Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.8607982Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.8608437Z                            module_map=module_map)
2025-05-07T20:33:08.8608828Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.8609197Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.8609446Z E       ^
2025-05-07T20:33:08.8609902Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.8610344Z 
2025-05-07T20:33:08.8610758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.8611305Z 
2025-05-07T20:33:08.8611414Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.8611859Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.8612250Z     T=128,
2025-05-07T20:33:08.8612434Z     D=7168,
2025-05-07T20:33:08.8612624Z     scale_ub=1200.0,
2025-05-07T20:33:08.8612844Z     contiguous=False,
2025-05-07T20:33:08.8613139Z     compiled=False,
2025-05-07T20:33:08.8613335Z )
2025-05-07T20:33:08.9729513Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.9730368Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:08.9730744Z 
2025-05-07T20:33:08.9730856Z     @given(
2025-05-07T20:33:08.9731174Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.9731600Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.9732018Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.9732450Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.9732786Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.9733142Z     )
2025-05-07T20:33:08.9733582Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.9734027Z     def test_silu_mul_quant(
2025-05-07T20:33:08.9734279Z         self,
2025-05-07T20:33:08.9734482Z         T: int,
2025-05-07T20:33:08.9734689Z         D: int,
2025-05-07T20:33:08.9734916Z         scale_ub: Optional[float],
2025-05-07T20:33:08.9735194Z         contiguous: bool,
2025-05-07T20:33:08.9735441Z         compiled: bool,
2025-05-07T20:33:08.9735684Z     ) -> None:
2025-05-07T20:33:08.9735907Z         torch.manual_seed(2025)
2025-05-07T20:33:08.9736156Z     
2025-05-07T20:33:08.9736432Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.9736793Z     
2025-05-07T20:33:08.9737004Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.9737302Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.9737629Z         x = x_sign * x_clamp
2025-05-07T20:33:08.9737898Z         x0 = x[:, :D]
2025-05-07T20:33:08.9738137Z         x1 = x[:, D:]
2025-05-07T20:33:08.9738355Z     
2025-05-07T20:33:08.9738555Z         if contiguous:
2025-05-07T20:33:08.9738792Z             x0 = x0.contiguous()
2025-05-07T20:33:08.9739061Z             x1 = x1.contiguous()
2025-05-07T20:33:08.9739323Z     
2025-05-07T20:33:08.9739523Z         if scale_ub is not None:
2025-05-07T20:33:08.9739808Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.9740158Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.9740472Z             )
2025-05-07T20:33:08.9740680Z         else:
2025-05-07T20:33:08.9740902Z             scale_ub_tensor = None
2025-05-07T20:33:08.9741167Z     
2025-05-07T20:33:08.9741406Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.9741727Z             op = silu_mul_quant
2025-05-07T20:33:08.9741991Z             if compiled:
2025-05-07T20:33:08.9742246Z                 op = torch.compile(op)
2025-05-07T20:33:08.9742550Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.9742841Z     
2025-05-07T20:33:08.9743040Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.9743212Z 
2025-05-07T20:33:08.9743317Z moe/activation_test.py:117: 
2025-05-07T20:33:08.9743618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.9743950Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.9744238Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.9744925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.9745608Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.9746141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.9746820Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.9747600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.9748124Z     kernel = self.compile(
2025-05-07T20:33:08.9748712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.9749357Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.9749749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.9750042Z 
2025-05-07T20:33:08.9750247Z self = <triton.compiler.compiler.ASTSource object at 0x7f05538185f0>
2025-05-07T20:33:08.9751304Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.9752654Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0553fb3e20>}
2025-05-07T20:33:08.9754008Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.9755013Z context = <triton._C.libtriton.ir.context object at 0x7f0552049b30>
2025-05-07T20:33:08.9755295Z 
2025-05-07T20:33:08.9755465Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.9755978Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.9756440Z                            module_map=module_map)
2025-05-07T20:33:08.9756798Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.9757148Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.9757411Z E       ^
2025-05-07T20:33:08.9757872Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.9758315Z 
2025-05-07T20:33:08.9758728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.9759469Z 
2025-05-07T20:33:08.9759576Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.9759986Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.9760379Z     T=128,
2025-05-07T20:33:08.9760569Z     D=5120,
2025-05-07T20:33:08.9760762Z     scale_ub=None,
2025-05-07T20:33:08.9760974Z     contiguous=False,
2025-05-07T20:33:08.9761197Z     compiled=False,
2025-05-07T20:33:08.9761403Z )
2025-05-07T20:33:08.9761717Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.9762227Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:08.9762493Z 
2025-05-07T20:33:08.9762570Z     @given(
2025-05-07T20:33:08.9762803Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.9763115Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.9763416Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.9763743Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.9764066Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.9764343Z     )
2025-05-07T20:33:08.9764685Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.9765127Z     def test_silu_mul_quant(
2025-05-07T20:33:08.9765370Z         self,
2025-05-07T20:33:08.9765560Z         T: int,
2025-05-07T20:33:08.9765763Z         D: int,
2025-05-07T20:33:08.9765984Z         scale_ub: Optional[float],
2025-05-07T20:33:08.9766249Z         contiguous: bool,
2025-05-07T20:33:08.9766488Z         compiled: bool,
2025-05-07T20:33:08.9766709Z     ) -> None:
2025-05-07T20:33:08.9767002Z         torch.manual_seed(2025)
2025-05-07T20:33:08.9767242Z     
2025-05-07T20:33:08.9767571Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.9767912Z     
2025-05-07T20:33:08.9768117Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.9768413Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.9768720Z         x = x_sign * x_clamp
2025-05-07T20:33:08.9768963Z         x0 = x[:, :D]
2025-05-07T20:33:08.9769179Z         x1 = x[:, D:]
2025-05-07T20:33:08.9769390Z     
2025-05-07T20:33:08.9769650Z         if contiguous:
2025-05-07T20:33:08.9769882Z             x0 = x0.contiguous()
2025-05-07T20:33:08.9770140Z             x1 = x1.contiguous()
2025-05-07T20:33:08.9770382Z     
2025-05-07T20:33:08.9770575Z         if scale_ub is not None:
2025-05-07T20:33:08.9770856Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.9771188Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.9777886Z             )
2025-05-07T20:33:08.9778106Z         else:
2025-05-07T20:33:08.9778322Z             scale_ub_tensor = None
2025-05-07T20:33:08.9778605Z     
2025-05-07T20:33:08.9778971Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.9779293Z             op = silu_mul_quant
2025-05-07T20:33:08.9779538Z             if compiled:
2025-05-07T20:33:08.9779788Z                 op = torch.compile(op)
2025-05-07T20:33:08.9780087Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.9780356Z     
2025-05-07T20:33:08.9780542Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.9780711Z 
2025-05-07T20:33:08.9780813Z moe/activation_test.py:117: 
2025-05-07T20:33:08.9781108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.9781430Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.9781711Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.9782387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.9783066Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.9783591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.9784254Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.9784898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.9785417Z     kernel = self.compile(
2025-05-07T20:33:08.9785949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.9786587Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.9786977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.9787200Z 
2025-05-07T20:33:08.9787406Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552039790>
2025-05-07T20:33:08.9788477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.9789870Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f055385c400>}
2025-05-07T20:33:08.9791189Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.9792190Z context = <triton._C.libtriton.ir.context object at 0x7f05527e3eb0>
2025-05-07T20:33:08.9792474Z 
2025-05-07T20:33:08.9792636Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.9793195Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.9793690Z                            module_map=module_map)
2025-05-07T20:33:08.9794045Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.9794392Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.9794645Z E       ^
2025-05-07T20:33:08.9795094Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.9795540Z 
2025-05-07T20:33:08.9795948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.9796492Z 
2025-05-07T20:33:08.9796595Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.9797006Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.9797395Z     T=128,
2025-05-07T20:33:08.9797581Z     D=5120,
2025-05-07T20:33:08.9797776Z     scale_ub=1200.0,
2025-05-07T20:33:08.9797994Z     contiguous=True,
2025-05-07T20:33:08.9798215Z     compiled=False,
2025-05-07T20:33:08.9798417Z )
2025-05-07T20:33:09.1497477Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.1498878Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:09.1499292Z 
2025-05-07T20:33:09.1499399Z     @given(
2025-05-07T20:33:09.1499713Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.1500115Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.1500435Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.1500768Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.1501099Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.1501381Z     )
2025-05-07T20:33:09.1501732Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.1502169Z     def test_silu_mul_quant(
2025-05-07T20:33:09.1502412Z         self,
2025-05-07T20:33:09.1502610Z         T: int,
2025-05-07T20:33:09.1502809Z         D: int,
2025-05-07T20:33:09.1503033Z         scale_ub: Optional[float],
2025-05-07T20:33:09.1503304Z         contiguous: bool,
2025-05-07T20:33:09.1503550Z         compiled: bool,
2025-05-07T20:33:09.1503773Z     ) -> None:
2025-05-07T20:33:09.1503996Z         torch.manual_seed(2025)
2025-05-07T20:33:09.1504246Z     
2025-05-07T20:33:09.1504522Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.1504873Z     
2025-05-07T20:33:09.1505076Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.1505367Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.1505678Z         x = x_sign * x_clamp
2025-05-07T20:33:09.1505931Z         x0 = x[:, :D]
2025-05-07T20:33:09.1506156Z         x1 = x[:, D:]
2025-05-07T20:33:09.1506366Z     
2025-05-07T20:33:09.1506557Z         if contiguous:
2025-05-07T20:33:09.1506793Z             x0 = x0.contiguous()
2025-05-07T20:33:09.1507058Z             x1 = x1.contiguous()
2025-05-07T20:33:09.1507309Z     
2025-05-07T20:33:09.1507510Z         if scale_ub is not None:
2025-05-07T20:33:09.1507786Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.1508125Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.1508433Z             )
2025-05-07T20:33:09.1508623Z         else:
2025-05-07T20:33:09.1508846Z             scale_ub_tensor = None
2025-05-07T20:33:09.1509104Z     
2025-05-07T20:33:09.1509344Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.1509664Z             op = silu_mul_quant
2025-05-07T20:33:09.1509922Z             if compiled:
2025-05-07T20:33:09.1510171Z                 op = torch.compile(op)
2025-05-07T20:33:09.1510471Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.1510751Z     
2025-05-07T20:33:09.1510950Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.1511115Z 
2025-05-07T20:33:09.1511218Z moe/activation_test.py:117: 
2025-05-07T20:33:09.1511609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.1511999Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.1512280Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.1512964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.1513644Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.1514171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.1514920Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.1515575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.1516097Z     kernel = self.compile(
2025-05-07T20:33:09.1516633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.1517291Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.1517728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.1517955Z 
2025-05-07T20:33:09.1518167Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552039d90>
2025-05-07T20:33:09.1519278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.1520636Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f055385d300>}
2025-05-07T20:33:09.1521951Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.1522969Z context = <triton._C.libtriton.ir.context object at 0x7f05523ae030>
2025-05-07T20:33:09.1523251Z 
2025-05-07T20:33:09.1523422Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.1523930Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.1524393Z                            module_map=module_map)
2025-05-07T20:33:09.1524762Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.1525123Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.1525390Z E       ^
2025-05-07T20:33:09.1525855Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.1526298Z 
2025-05-07T20:33:09.1526715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.1527222Z 
2025-05-07T20:33:09.1527327Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.1527749Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.1528154Z     T=1,
2025-05-07T20:33:09.1528342Z     D=7168,
2025-05-07T20:33:09.1528542Z     scale_ub=1200.0,
2025-05-07T20:33:09.1528769Z     contiguous=True,
2025-05-07T20:33:09.1529005Z     compiled=True,
2025-05-07T20:33:09.1529247Z )
2025-05-07T20:33:09.1529566Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.1530048Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:09.1530312Z 
2025-05-07T20:33:09.1530390Z     @given(
2025-05-07T20:33:09.1530622Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.1530937Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.1531241Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.1531620Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.1531954Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.1532299Z     )
2025-05-07T20:33:09.1532650Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.1533177Z     def test_silu_mul_quant(
2025-05-07T20:33:09.1533413Z         self,
2025-05-07T20:33:09.1533615Z         T: int,
2025-05-07T20:33:09.1533822Z         D: int,
2025-05-07T20:33:09.1534040Z         scale_ub: Optional[float],
2025-05-07T20:33:09.1534319Z         contiguous: bool,
2025-05-07T20:33:09.1534608Z         compiled: bool,
2025-05-07T20:33:09.1534828Z     ) -> None:
2025-05-07T20:33:09.1535044Z         torch.manual_seed(2025)
2025-05-07T20:33:09.1535282Z     
2025-05-07T20:33:09.1535554Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.1535914Z     
2025-05-07T20:33:09.1536113Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.1536398Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.1536704Z         x = x_sign * x_clamp
2025-05-07T20:33:09.1536944Z         x0 = x[:, :D]
2025-05-07T20:33:09.1537216Z         x1 = x[:, D:]
2025-05-07T20:33:09.1537425Z     
2025-05-07T20:33:09.1537614Z         if contiguous:
2025-05-07T20:33:09.1537850Z             x0 = x0.contiguous()
2025-05-07T20:33:09.1538106Z             x1 = x1.contiguous()
2025-05-07T20:33:09.1538351Z     
2025-05-07T20:33:09.1538545Z         if scale_ub is not None:
2025-05-07T20:33:09.1538818Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.1539151Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.1539456Z             )
2025-05-07T20:33:09.1539650Z         else:
2025-05-07T20:33:09.1539856Z             scale_ub_tensor = None
2025-05-07T20:33:09.1540110Z     
2025-05-07T20:33:09.1540339Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.1540650Z             op = silu_mul_quant
2025-05-07T20:33:09.1540907Z             if compiled:
2025-05-07T20:33:09.1541153Z                 op = torch.compile(op)
2025-05-07T20:33:09.1541446Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.1541728Z     
2025-05-07T20:33:09.1541924Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.1542086Z 
2025-05-07T20:33:09.1542187Z moe/activation_test.py:117: 
2025-05-07T20:33:09.1542492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.1542833Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.1543110Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.1543662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:09.1544218Z     return fn(*args, **kwargs)
2025-05-07T20:33:09.1544873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.1545541Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.1546082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.1546758Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.1547413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.1547938Z     kernel = self.compile(
2025-05-07T20:33:09.1548484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.1549129Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.1549520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.1549747Z 
2025-05-07T20:33:09.1549951Z self = <triton.compiler.compiler.ASTSource object at 0x7f055203a450>
2025-05-07T20:33:09.1551061Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.1552442Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f055385eac0>}
2025-05-07T20:33:09.1553759Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.1554805Z context = <triton._C.libtriton.ir.context object at 0x7f05527c79b0>
2025-05-07T20:33:09.1555088Z 
2025-05-07T20:33:09.1555253Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.1555766Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.1556232Z                            module_map=module_map)
2025-05-07T20:33:09.1556590Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.1556989Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.1557257Z E       ^
2025-05-07T20:33:09.1557718Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.1558164Z 
2025-05-07T20:33:09.1558575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.1559083Z 
2025-05-07T20:33:09.1559356Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.1559770Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.1560159Z     T=1,
2025-05-07T20:33:09.1560349Z     D=7168,
2025-05-07T20:33:09.1560545Z     scale_ub=1200.0,
2025-05-07T20:33:09.1560763Z     contiguous=False,
2025-05-07T20:33:09.1560987Z     compiled=True,
2025-05-07T20:33:09.1561196Z )
2025-05-07T20:33:09.2860690Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.2861455Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:09.2861825Z 
2025-05-07T20:33:09.2861941Z     @given(
2025-05-07T20:33:09.2862178Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.2862490Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.2862800Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.2863130Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.2863465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.2863745Z     )
2025-05-07T20:33:09.2864093Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.2864527Z     def test_silu_mul_quant(
2025-05-07T20:33:09.2864769Z         self,
2025-05-07T20:33:09.2864970Z         T: int,
2025-05-07T20:33:09.2865170Z         D: int,
2025-05-07T20:33:09.2865388Z         scale_ub: Optional[float],
2025-05-07T20:33:09.2865668Z         contiguous: bool,
2025-05-07T20:33:09.2865910Z         compiled: bool,
2025-05-07T20:33:09.2866134Z     ) -> None:
2025-05-07T20:33:09.2866353Z         torch.manual_seed(2025)
2025-05-07T20:33:09.2866594Z     
2025-05-07T20:33:09.2866921Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.2867396Z     
2025-05-07T20:33:09.2867669Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.2868020Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.2868336Z         x = x_sign * x_clamp
2025-05-07T20:33:09.2868583Z         x0 = x[:, :D]
2025-05-07T20:33:09.2868814Z         x1 = x[:, D:]
2025-05-07T20:33:09.2869024Z     
2025-05-07T20:33:09.2869223Z         if contiguous:
2025-05-07T20:33:09.2869463Z             x0 = x0.contiguous()
2025-05-07T20:33:09.2869721Z             x1 = x1.contiguous()
2025-05-07T20:33:09.2869969Z     
2025-05-07T20:33:09.2870283Z         if scale_ub is not None:
2025-05-07T20:33:09.2870557Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.2870954Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.2871265Z             )
2025-05-07T20:33:09.2871458Z         else:
2025-05-07T20:33:09.2871673Z             scale_ub_tensor = None
2025-05-07T20:33:09.2871926Z     
2025-05-07T20:33:09.2872154Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.2872473Z             op = silu_mul_quant
2025-05-07T20:33:09.2872726Z             if compiled:
2025-05-07T20:33:09.2873040Z                 op = torch.compile(op)
2025-05-07T20:33:09.2873329Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.2873604Z     
2025-05-07T20:33:09.2873799Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.2873961Z 
2025-05-07T20:33:09.2874060Z moe/activation_test.py:117: 
2025-05-07T20:33:09.2874353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.2874680Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.2874956Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.2875574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:09.2876124Z     return fn(*args, **kwargs)
2025-05-07T20:33:09.2876777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.2877444Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.2877972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.2878647Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.2879294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.2879814Z     kernel = self.compile(
2025-05-07T20:33:09.2880356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.2881011Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.2881402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.2881629Z 
2025-05-07T20:33:09.2881834Z self = <triton.compiler.compiler.ASTSource object at 0x7f055203bc20>
2025-05-07T20:33:09.2882900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.2884249Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05538079c0>}
2025-05-07T20:33:09.2885561Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.2886572Z context = <triton._C.libtriton.ir.context object at 0x7f0552e0e370>
2025-05-07T20:33:09.2886861Z 
2025-05-07T20:33:09.2887026Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.2887535Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.2887993Z                            module_map=module_map)
2025-05-07T20:33:09.2888362Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.2888715Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.2888972Z E       ^
2025-05-07T20:33:09.2889425Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.2889873Z 
2025-05-07T20:33:09.2890284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.2890835Z 
2025-05-07T20:33:09.2890984Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.2891393Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.2891793Z     T=1,
2025-05-07T20:33:09.2891979Z     D=7168,
2025-05-07T20:33:09.2892173Z     scale_ub=None,
2025-05-07T20:33:09.2892382Z     contiguous=False,
2025-05-07T20:33:09.2892610Z     compiled=True,
2025-05-07T20:33:09.2892810Z )
2025-05-07T20:33:09.3749999Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.3750869Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:09.3751227Z 
2025-05-07T20:33:09.3751339Z     @given(
2025-05-07T20:33:09.3751615Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.3751921Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.3752224Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.3752557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.3752884Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.3753238Z     )
2025-05-07T20:33:09.3753584Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.3754020Z     def test_silu_mul_quant(
2025-05-07T20:33:09.3754255Z         self,
2025-05-07T20:33:09.3754445Z         T: int,
2025-05-07T20:33:09.3754641Z         D: int,
2025-05-07T20:33:09.3754853Z         scale_ub: Optional[float],
2025-05-07T20:33:09.3755124Z         contiguous: bool,
2025-05-07T20:33:09.3755359Z         compiled: bool,
2025-05-07T20:33:09.3755578Z     ) -> None:
2025-05-07T20:33:09.3755796Z         torch.manual_seed(2025)
2025-05-07T20:33:09.3756033Z     
2025-05-07T20:33:09.3756296Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.3756632Z     
2025-05-07T20:33:09.3756826Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.3757113Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.3757421Z         x = x_sign * x_clamp
2025-05-07T20:33:09.3757668Z         x0 = x[:, :D]
2025-05-07T20:33:09.3757882Z         x1 = x[:, D:]
2025-05-07T20:33:09.3758086Z     
2025-05-07T20:33:09.3758265Z         if contiguous:
2025-05-07T20:33:09.3758506Z             x0 = x0.contiguous()
2025-05-07T20:33:09.3758799Z             x1 = x1.contiguous()
2025-05-07T20:33:09.3759035Z     
2025-05-07T20:33:09.3759416Z         if scale_ub is not None:
2025-05-07T20:33:09.3759685Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.3760015Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.3760321Z             )
2025-05-07T20:33:09.3760508Z         else:
2025-05-07T20:33:09.3760720Z             scale_ub_tensor = None
2025-05-07T20:33:09.3760972Z     
2025-05-07T20:33:09.3761198Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.3761507Z             op = silu_mul_quant
2025-05-07T20:33:09.3761754Z             if compiled:
2025-05-07T20:33:09.3761995Z                 op = torch.compile(op)
2025-05-07T20:33:09.3762293Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.3762565Z     
2025-05-07T20:33:09.3762751Z         y_fp8, y_scale = fn()
2025-05-07T20:33:09.3763035Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:09.3763322Z     
2025-05-07T20:33:09.3763556Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.3763883Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:09.3764171Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:09.3764478Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:09.3764826Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:09.3765131Z     
2025-05-07T20:33:09.3765331Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:09.3765521Z 
2025-05-07T20:33:09.3765720Z moe/activation_test.py:126: 
2025-05-07T20:33:09.3766014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.3766459Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:09.3766784Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:09.3767555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:09.3768292Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:09.3768838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.3769603Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.3770280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:09.3770986Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:09.3771707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:09.3772384Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:09.3773043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:09.3773551Z     fn()
2025-05-07T20:33:09.3774060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:09.3774632Z     self.fn.run(
2025-05-07T20:33:09.3775101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.3775619Z     kernel = self.compile(
2025-05-07T20:33:09.3781963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.3782608Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.3783006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.3783234Z 
2025-05-07T20:33:09.3783441Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552719940>
2025-05-07T20:33:09.3784512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.3785865Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552298b80>}
2025-05-07T20:33:09.3787189Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.3788195Z context = <triton._C.libtriton.ir.context object at 0x7f0552e4d5b0>
2025-05-07T20:33:09.3788478Z 
2025-05-07T20:33:09.3788676Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.3789213Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.3789679Z                            module_map=module_map)
2025-05-07T20:33:09.3790031Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.3790383Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:09.3790647Z E       ^
2025-05-07T20:33:09.3791101Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.3791546Z 
2025-05-07T20:33:09.3791957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.3792467Z 
2025-05-07T20:33:09.3792569Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.3793052Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.3793441Z     T=1,
2025-05-07T20:33:09.3793670Z     D=5120,
2025-05-07T20:33:09.3793863Z     scale_ub=1200.0,
2025-05-07T20:33:09.3794079Z     contiguous=False,
2025-05-07T20:33:09.3794296Z     compiled=True,
2025-05-07T20:33:09.3794494Z )
2025-05-07T20:33:09.5322002Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.5322786Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:09.5323288Z 
2025-05-07T20:33:09.5323398Z     @given(
2025-05-07T20:33:09.5323717Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.5324124Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.5324528Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.5324945Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.5325266Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.5325553Z     )
2025-05-07T20:33:09.5325902Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.5326418Z     def test_silu_mul_quant(
2025-05-07T20:33:09.5326661Z         self,
2025-05-07T20:33:09.5326857Z         T: int,
2025-05-07T20:33:09.5327051Z         D: int,
2025-05-07T20:33:09.5327266Z         scale_ub: Optional[float],
2025-05-07T20:33:09.5327539Z         contiguous: bool,
2025-05-07T20:33:09.5327775Z         compiled: bool,
2025-05-07T20:33:09.5328002Z     ) -> None:
2025-05-07T20:33:09.5328223Z         torch.manual_seed(2025)
2025-05-07T20:33:09.5328464Z     
2025-05-07T20:33:09.5328728Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.5329097Z     
2025-05-07T20:33:09.5329316Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.5329598Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.5329905Z         x = x_sign * x_clamp
2025-05-07T20:33:09.5330146Z         x0 = x[:, :D]
2025-05-07T20:33:09.5330354Z         x1 = x[:, D:]
2025-05-07T20:33:09.5330569Z     
2025-05-07T20:33:09.5330763Z         if contiguous:
2025-05-07T20:33:09.5330995Z             x0 = x0.contiguous()
2025-05-07T20:33:09.5331254Z             x1 = x1.contiguous()
2025-05-07T20:33:09.5331501Z     
2025-05-07T20:33:09.5331694Z         if scale_ub is not None:
2025-05-07T20:33:09.5331965Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.5332302Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.5332604Z             )
2025-05-07T20:33:09.5332800Z         else:
2025-05-07T20:33:09.5333104Z             scale_ub_tensor = None
2025-05-07T20:33:09.5333351Z     
2025-05-07T20:33:09.5333589Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.5333901Z             op = silu_mul_quant
2025-05-07T20:33:09.5334151Z             if compiled:
2025-05-07T20:33:09.5334389Z                 op = torch.compile(op)
2025-05-07T20:33:09.5334684Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.5334960Z     
2025-05-07T20:33:09.5335150Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.5335321Z 
2025-05-07T20:33:09.5335418Z moe/activation_test.py:117: 
2025-05-07T20:33:09.5335712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.5336030Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.5336309Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.5336856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:09.5337413Z     return fn(*args, **kwargs)
2025-05-07T20:33:09.5338058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.5338737Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.5339261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.5339994Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.5340704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.5341232Z     kernel = self.compile(
2025-05-07T20:33:09.5341762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.5342396Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.5342832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.5343054Z 
2025-05-07T20:33:09.5343261Z self = <triton.compiler.compiler.ASTSource object at 0x7f055271a960>
2025-05-07T20:33:09.5344323Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.5345712Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552299e40>}
2025-05-07T20:33:09.5347028Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.5348050Z context = <triton._C.libtriton.ir.context object at 0x7f03a1c9a170>
2025-05-07T20:33:09.5348333Z 
2025-05-07T20:33:09.5348513Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.5349071Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.5349536Z                            module_map=module_map)
2025-05-07T20:33:09.5349900Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.5350251Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.5350504Z E       ^
2025-05-07T20:33:09.5350966Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.5351405Z 
2025-05-07T20:33:09.5351825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.5352327Z 
2025-05-07T20:33:09.5352438Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.5352839Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.5353237Z     T=1,
2025-05-07T20:33:09.5353418Z     D=5120,
2025-05-07T20:33:09.5353604Z     scale_ub=1200.0,
2025-05-07T20:33:09.5353828Z     contiguous=False,
2025-05-07T20:33:09.5354053Z     compiled=False,
2025-05-07T20:33:09.5354254Z )
2025-05-07T20:33:09.5354573Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.5355060Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:09.5355320Z 
2025-05-07T20:33:09.5355399Z     @given(
2025-05-07T20:33:09.5355635Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.5355944Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.5356247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.5356568Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.5356898Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.5357180Z     )
2025-05-07T20:33:09.5357518Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.5357954Z     def test_silu_mul_quant(
2025-05-07T20:33:09.5358190Z         self,
2025-05-07T20:33:09.5358383Z         T: int,
2025-05-07T20:33:09.5358587Z         D: int,
2025-05-07T20:33:09.5358812Z         scale_ub: Optional[float],
2025-05-07T20:33:09.5359085Z         contiguous: bool,
2025-05-07T20:33:09.5359562Z         compiled: bool,
2025-05-07T20:33:09.5359782Z     ) -> None:
2025-05-07T20:33:09.5360594Z         torch.manual_seed(2025)
2025-05-07T20:33:09.5360845Z     
2025-05-07T20:33:09.5361116Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.5361448Z     
2025-05-07T20:33:09.5361640Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.5361930Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.5362233Z         x = x_sign * x_clamp
2025-05-07T20:33:09.5362465Z         x0 = x[:, :D]
2025-05-07T20:33:09.5362751Z         x1 = x[:, D:]
2025-05-07T20:33:09.5362962Z     
2025-05-07T20:33:09.5363149Z         if contiguous:
2025-05-07T20:33:09.5363375Z             x0 = x0.contiguous()
2025-05-07T20:33:09.5363631Z             x1 = x1.contiguous()
2025-05-07T20:33:09.5363867Z     
2025-05-07T20:33:09.5364065Z         if scale_ub is not None:
2025-05-07T20:33:09.5364336Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.5364665Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.5364970Z             )
2025-05-07T20:33:09.5365169Z         else:
2025-05-07T20:33:09.5365462Z             scale_ub_tensor = None
2025-05-07T20:33:09.5365717Z     
2025-05-07T20:33:09.5365954Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.5366258Z             op = silu_mul_quant
2025-05-07T20:33:09.5366508Z             if compiled:
2025-05-07T20:33:09.5366758Z                 op = torch.compile(op)
2025-05-07T20:33:09.5367051Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.5367326Z     
2025-05-07T20:33:09.5367518Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.5367681Z 
2025-05-07T20:33:09.5367778Z moe/activation_test.py:117: 
2025-05-07T20:33:09.5368065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.5368401Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.5368681Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.5369416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.5370092Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.5370623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.5371287Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.5371939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.5372472Z     kernel = self.compile(
2025-05-07T20:33:09.5373044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.5373687Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.5374075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.5374300Z 
2025-05-07T20:33:09.5374507Z self = <triton.compiler.compiler.ASTSource object at 0x7f055271bfe0>
2025-05-07T20:33:09.5375562Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.5376907Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f055229aac0>}
2025-05-07T20:33:09.5378222Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.5379228Z context = <triton._C.libtriton.ir.context object at 0x7f03a1f0ca70>
2025-05-07T20:33:09.5379509Z 
2025-05-07T20:33:09.5379745Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.5380294Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.5380751Z                            module_map=module_map)
2025-05-07T20:33:09.5381113Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.5381452Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.5381710Z E       ^
2025-05-07T20:33:09.5382168Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.5382650Z 
2025-05-07T20:33:09.5383061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.5383560Z 
2025-05-07T20:33:09.5383662Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.5384081Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.5384479Z     T=16384,
2025-05-07T20:33:09.5384674Z     D=5120,
2025-05-07T20:33:09.5384858Z     scale_ub=1200.0,
2025-05-07T20:33:09.5385080Z     contiguous=False,
2025-05-07T20:33:09.5385349Z     compiled=True,
2025-05-07T20:33:09.5385546Z )
2025-05-07T20:33:09.6248842Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.6249558Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:09.6249955Z 
2025-05-07T20:33:09.6250067Z     @given(
2025-05-07T20:33:09.6250384Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.6250825Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.6251132Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.6251456Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.6251783Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.6252070Z     )
2025-05-07T20:33:09.6252418Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.6252857Z     def test_silu_mul_quant(
2025-05-07T20:33:09.6253154Z         self,
2025-05-07T20:33:09.6253358Z         T: int,
2025-05-07T20:33:09.6253557Z         D: int,
2025-05-07T20:33:09.6253774Z         scale_ub: Optional[float],
2025-05-07T20:33:09.6254047Z         contiguous: bool,
2025-05-07T20:33:09.6254289Z         compiled: bool,
2025-05-07T20:33:09.6254516Z     ) -> None:
2025-05-07T20:33:09.6254728Z         torch.manual_seed(2025)
2025-05-07T20:33:09.6254971Z     
2025-05-07T20:33:09.6255246Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.6255591Z     
2025-05-07T20:33:09.6255788Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.6256082Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.6256385Z         x = x_sign * x_clamp
2025-05-07T20:33:09.6256628Z         x0 = x[:, :D]
2025-05-07T20:33:09.6256845Z         x1 = x[:, D:]
2025-05-07T20:33:09.6257051Z     
2025-05-07T20:33:09.6257240Z         if contiguous:
2025-05-07T20:33:09.6257471Z             x0 = x0.contiguous()
2025-05-07T20:33:09.6257726Z             x1 = x1.contiguous()
2025-05-07T20:33:09.6257974Z     
2025-05-07T20:33:09.6258164Z         if scale_ub is not None:
2025-05-07T20:33:09.6258437Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.6258775Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.6259084Z             )
2025-05-07T20:33:09.6259433Z         else:
2025-05-07T20:33:09.6259638Z             scale_ub_tensor = None
2025-05-07T20:33:09.6259898Z     
2025-05-07T20:33:09.6260129Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.6260439Z             op = silu_mul_quant
2025-05-07T20:33:09.6260691Z             if compiled:
2025-05-07T20:33:09.6260943Z                 op = torch.compile(op)
2025-05-07T20:33:09.6261233Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.6261511Z     
2025-05-07T20:33:09.6261814Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.6261978Z 
2025-05-07T20:33:09.6262079Z moe/activation_test.py:117: 
2025-05-07T20:33:09.6262444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.6262774Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.6263054Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.6263606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:09.6264168Z     return fn(*args, **kwargs)
2025-05-07T20:33:09.6264825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.6265562Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.6266090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.6266764Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.6267429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.6268006Z     kernel = self.compile(
2025-05-07T20:33:09.6268547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.6269196Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.6269591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.6269827Z 
2025-05-07T20:33:09.6270032Z self = <triton.compiler.compiler.ASTSource object at 0x7f055271bf50>
2025-05-07T20:33:09.6271095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.6272450Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1ff0180>}
2025-05-07T20:33:09.6273780Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.6274784Z context = <triton._C.libtriton.ir.context object at 0x7f03a1fb1330>
2025-05-07T20:33:09.6275074Z 
2025-05-07T20:33:09.6275239Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.6275753Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.6276219Z                            module_map=module_map)
2025-05-07T20:33:09.6276579Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.6276932Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.6277190Z E       ^
2025-05-07T20:33:09.6277644Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.6278094Z 
2025-05-07T20:33:09.6278510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.6279068Z 
2025-05-07T20:33:09.6279174Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.6279584Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.6279976Z     T=2048,
2025-05-07T20:33:09.6280168Z     D=7168,
2025-05-07T20:33:09.6280363Z     scale_ub=1200.0,
2025-05-07T20:33:09.6280580Z     contiguous=False,
2025-05-07T20:33:09.6280807Z     compiled=True,
2025-05-07T20:33:09.6281008Z )
2025-05-07T20:33:09.6281321Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.6281813Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:09.6282133Z 
2025-05-07T20:33:09.6282218Z     @given(
2025-05-07T20:33:09.6282446Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.6282792Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.6283101Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.6283432Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.6283757Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.6284045Z     )
2025-05-07T20:33:09.6284389Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.6284865Z     def test_silu_mul_quant(
2025-05-07T20:33:09.6285109Z         self,
2025-05-07T20:33:09.6285306Z         T: int,
2025-05-07T20:33:09.6285502Z         D: int,
2025-05-07T20:33:09.6285723Z         scale_ub: Optional[float],
2025-05-07T20:33:09.6285997Z         contiguous: bool,
2025-05-07T20:33:09.6286232Z         compiled: bool,
2025-05-07T20:33:09.6286457Z     ) -> None:
2025-05-07T20:33:09.6286681Z         torch.manual_seed(2025)
2025-05-07T20:33:09.6286925Z     
2025-05-07T20:33:09.6287197Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.6287586Z     
2025-05-07T20:33:09.6287781Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.6288068Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.6288378Z         x = x_sign * x_clamp
2025-05-07T20:33:09.6288623Z         x0 = x[:, :D]
2025-05-07T20:33:09.6288838Z         x1 = x[:, D:]
2025-05-07T20:33:09.6289052Z     
2025-05-07T20:33:09.6289267Z         if contiguous:
2025-05-07T20:33:09.6289524Z             x0 = x0.contiguous()
2025-05-07T20:33:09.6289781Z             x1 = x1.contiguous()
2025-05-07T20:33:09.6290024Z     
2025-05-07T20:33:09.6290210Z         if scale_ub is not None:
2025-05-07T20:33:09.6290488Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.6290820Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.6291130Z             )
2025-05-07T20:33:09.6291321Z         else:
2025-05-07T20:33:09.6291533Z             scale_ub_tensor = None
2025-05-07T20:33:09.6291802Z     
2025-05-07T20:33:09.6292039Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.6292353Z             op = silu_mul_quant
2025-05-07T20:33:09.6292606Z             if compiled:
2025-05-07T20:33:09.6292851Z                 op = torch.compile(op)
2025-05-07T20:33:09.6293215Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.6293489Z     
2025-05-07T20:33:09.6293678Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.6293851Z 
2025-05-07T20:33:09.6293953Z moe/activation_test.py:117: 
2025-05-07T20:33:09.6294247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.6294574Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.6294849Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.6295400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:09.6295951Z     return fn(*args, **kwargs)
2025-05-07T20:33:09.6296605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.6297282Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.6297810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.6298483Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.6299135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.6299662Z     kernel = self.compile(
2025-05-07T20:33:09.6300200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.6300838Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.6301284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.6301518Z 
2025-05-07T20:33:09.6301766Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1f03ce0>
2025-05-07T20:33:09.6302844Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.6304190Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1ff0ea0>}
2025-05-07T20:33:09.6305572Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.6306580Z context = <triton._C.libtriton.ir.context object at 0x7f05539c27f0>
2025-05-07T20:33:09.6306868Z 
2025-05-07T20:33:09.6307032Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.6307588Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.6308047Z                            module_map=module_map)
2025-05-07T20:33:09.6308413Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.6308765Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.6309022Z E       ^
2025-05-07T20:33:09.6309479Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.6309931Z 
2025-05-07T20:33:09.6310344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.6310849Z 
2025-05-07T20:33:09.7453525Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.7453978Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.7454413Z     T=1,
2025-05-07T20:33:09.7454617Z     D=5120,
2025-05-07T20:33:09.7454817Z     scale_ub=None,
2025-05-07T20:33:09.7455037Z     contiguous=False,
2025-05-07T20:33:09.7455264Z     compiled=False,
2025-05-07T20:33:09.7455476Z )
2025-05-07T20:33:09.7455799Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.7456281Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:09.7456547Z 
2025-05-07T20:33:09.7456626Z     @given(
2025-05-07T20:33:09.7456860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.7457165Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.7457467Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.7457790Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.7458108Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.7458395Z     )
2025-05-07T20:33:09.7458742Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.7459376Z     def test_silu_mul_quant(
2025-05-07T20:33:09.7465960Z         self,
2025-05-07T20:33:09.7466183Z         T: int,
2025-05-07T20:33:09.7466389Z         D: int,
2025-05-07T20:33:09.7466629Z         scale_ub: Optional[float],
2025-05-07T20:33:09.7466930Z         contiguous: bool,
2025-05-07T20:33:09.7467186Z         compiled: bool,
2025-05-07T20:33:09.7467434Z     ) -> None:
2025-05-07T20:33:09.7467658Z         torch.manual_seed(2025)
2025-05-07T20:33:09.7467899Z     
2025-05-07T20:33:09.7468176Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.7468530Z     
2025-05-07T20:33:09.7468732Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.7469067Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.7469378Z         x = x_sign * x_clamp
2025-05-07T20:33:09.7469629Z         x0 = x[:, :D]
2025-05-07T20:33:09.7469982Z         x1 = x[:, D:]
2025-05-07T20:33:09.7470196Z     
2025-05-07T20:33:09.7470385Z         if contiguous:
2025-05-07T20:33:09.7470678Z             x0 = x0.contiguous()
2025-05-07T20:33:09.7470948Z             x1 = x1.contiguous()
2025-05-07T20:33:09.7471189Z     
2025-05-07T20:33:09.7471383Z         if scale_ub is not None:
2025-05-07T20:33:09.7471652Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.7471990Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.7472293Z             )
2025-05-07T20:33:09.7472494Z         else:
2025-05-07T20:33:09.7472768Z             scale_ub_tensor = None
2025-05-07T20:33:09.7473014Z     
2025-05-07T20:33:09.7473253Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.7473569Z             op = silu_mul_quant
2025-05-07T20:33:09.7473810Z             if compiled:
2025-05-07T20:33:09.7474066Z                 op = torch.compile(op)
2025-05-07T20:33:09.7474367Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.7474639Z     
2025-05-07T20:33:09.7474823Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.7474987Z 
2025-05-07T20:33:09.7475155Z moe/activation_test.py:117: 
2025-05-07T20:33:09.7475453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.7475775Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.7476051Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.7476737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.7477415Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.7477945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.7478628Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.7479337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.7479886Z     kernel = self.compile(
2025-05-07T20:33:09.7480428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.7481078Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.7481474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.7481698Z 
2025-05-07T20:33:09.7481912Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1f03290>
2025-05-07T20:33:09.7482974Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.7484333Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1ff1e40>}
2025-05-07T20:33:09.7485657Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.7486669Z context = <triton._C.libtriton.ir.context object at 0x7f055230f8f0>
2025-05-07T20:33:09.7486956Z 
2025-05-07T20:33:09.7487126Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.7487638Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.7488106Z                            module_map=module_map)
2025-05-07T20:33:09.7488469Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.7488815Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.7489084Z E       ^
2025-05-07T20:33:09.7489583Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.7490080Z 
2025-05-07T20:33:09.7490557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.7491061Z 
2025-05-07T20:33:09.7491167Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.7491580Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.7491981Z     T=4096,
2025-05-07T20:33:09.7492167Z     D=7168,
2025-05-07T20:33:09.7492369Z     scale_ub=1200.0,
2025-05-07T20:33:09.7492596Z     contiguous=False,
2025-05-07T20:33:09.7492862Z     compiled=False,
2025-05-07T20:33:09.7493155Z )
2025-05-07T20:33:09.7493473Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.7493957Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:09.7494234Z 
2025-05-07T20:33:09.7494312Z     @given(
2025-05-07T20:33:09.7494540Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.7494854Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.7495150Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.7495524Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.7495853Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.7496130Z     )
2025-05-07T20:33:09.7496479Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.7496919Z     def test_silu_mul_quant(
2025-05-07T20:33:09.7497150Z         self,
2025-05-07T20:33:09.7497344Z         T: int,
2025-05-07T20:33:09.7497554Z         D: int,
2025-05-07T20:33:09.7497770Z         scale_ub: Optional[float],
2025-05-07T20:33:09.7498042Z         contiguous: bool,
2025-05-07T20:33:09.7498284Z         compiled: bool,
2025-05-07T20:33:09.7498516Z     ) -> None:
2025-05-07T20:33:09.7498730Z         torch.manual_seed(2025)
2025-05-07T20:33:09.7498974Z     
2025-05-07T20:33:09.7499244Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.7499580Z     
2025-05-07T20:33:09.7499773Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.7500073Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.7500383Z         x = x_sign * x_clamp
2025-05-07T20:33:09.7500630Z         x0 = x[:, :D]
2025-05-07T20:33:09.7500850Z         x1 = x[:, D:]
2025-05-07T20:33:09.7501053Z     
2025-05-07T20:33:09.7501243Z         if contiguous:
2025-05-07T20:33:09.7501474Z             x0 = x0.contiguous()
2025-05-07T20:33:09.7501730Z             x1 = x1.contiguous()
2025-05-07T20:33:09.7501979Z     
2025-05-07T20:33:09.7502171Z         if scale_ub is not None:
2025-05-07T20:33:09.7502440Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.7502775Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.7503089Z             )
2025-05-07T20:33:09.7503279Z         else:
2025-05-07T20:33:09.7503483Z             scale_ub_tensor = None
2025-05-07T20:33:09.7503738Z     
2025-05-07T20:33:09.7503976Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.7504287Z             op = silu_mul_quant
2025-05-07T20:33:09.7504539Z             if compiled:
2025-05-07T20:33:09.7504794Z                 op = torch.compile(op)
2025-05-07T20:33:09.7505088Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.7505359Z     
2025-05-07T20:33:09.7505560Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.7505722Z 
2025-05-07T20:33:09.7505823Z moe/activation_test.py:117: 
2025-05-07T20:33:09.7506120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.7506447Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.7506723Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.7507406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.7508084Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.7508667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.7509403Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.7510058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.7510581Z     kernel = self.compile(
2025-05-07T20:33:09.7511119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.7511803Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.7512195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.7512419Z 
2025-05-07T20:33:09.7512627Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1f02d50>
2025-05-07T20:33:09.7513696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.7515084Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1ff3380>}
2025-05-07T20:33:09.7516401Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.7517409Z context = <triton._C.libtriton.ir.context object at 0x7f0552395b70>
2025-05-07T20:33:09.7517690Z 
2025-05-07T20:33:09.7517865Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.7518376Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.7518843Z                            module_map=module_map)
2025-05-07T20:33:09.7519214Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.7519565Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.7519826Z E       ^
2025-05-07T20:33:09.7520283Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.7520728Z 
2025-05-07T20:33:09.7521144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.7521643Z 
2025-05-07T20:33:09.7521749Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.7522161Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.7522559Z     T=16384,
2025-05-07T20:33:09.7522756Z     D=7168,
2025-05-07T20:33:09.7522950Z     scale_ub=None,
2025-05-07T20:33:09.7523171Z     contiguous=True,
2025-05-07T20:33:09.7523399Z     compiled=True,
2025-05-07T20:33:09.7523603Z )
2025-05-07T20:33:09.9240941Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.9241483Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:09.9241754Z 
2025-05-07T20:33:09.9241844Z     @given(
2025-05-07T20:33:09.9242074Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.9242395Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.9242704Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.9243030Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.9243362Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.9243642Z     )
2025-05-07T20:33:09.9243986Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.9244429Z     def test_silu_mul_quant(
2025-05-07T20:33:09.9244669Z         self,
2025-05-07T20:33:09.9244867Z         T: int,
2025-05-07T20:33:09.9245067Z         D: int,
2025-05-07T20:33:09.9245394Z         scale_ub: Optional[float],
2025-05-07T20:33:09.9245663Z         contiguous: bool,
2025-05-07T20:33:09.9245903Z         compiled: bool,
2025-05-07T20:33:09.9246198Z     ) -> None:
2025-05-07T20:33:09.9246413Z         torch.manual_seed(2025)
2025-05-07T20:33:09.9246656Z     
2025-05-07T20:33:09.9246929Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.9247262Z     
2025-05-07T20:33:09.9247460Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.9247753Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.9248126Z         x = x_sign * x_clamp
2025-05-07T20:33:09.9248366Z         x0 = x[:, :D]
2025-05-07T20:33:09.9248585Z         x1 = x[:, D:]
2025-05-07T20:33:09.9248796Z     
2025-05-07T20:33:09.9248979Z         if contiguous:
2025-05-07T20:33:09.9249209Z             x0 = x0.contiguous()
2025-05-07T20:33:09.9249469Z             x1 = x1.contiguous()
2025-05-07T20:33:09.9249704Z     
2025-05-07T20:33:09.9249901Z         if scale_ub is not None:
2025-05-07T20:33:09.9250175Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.9250507Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.9250875Z             )
2025-05-07T20:33:09.9251074Z         else:
2025-05-07T20:33:09.9251281Z             scale_ub_tensor = None
2025-05-07T20:33:09.9251539Z     
2025-05-07T20:33:09.9251780Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.9252093Z             op = silu_mul_quant
2025-05-07T20:33:09.9252351Z             if compiled:
2025-05-07T20:33:09.9252606Z                 op = torch.compile(op)
2025-05-07T20:33:09.9252901Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.9253238Z     
2025-05-07T20:33:09.9253440Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.9253607Z 
2025-05-07T20:33:09.9253706Z moe/activation_test.py:117: 
2025-05-07T20:33:09.9254005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.9254341Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.9254625Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.9255184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:09.9255743Z     return fn(*args, **kwargs)
2025-05-07T20:33:09.9256396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.9257067Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.9257598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.9258275Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.9258932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.9259614Z     kernel = self.compile(
2025-05-07T20:33:09.9260159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.9260812Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.9261208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.9261435Z 
2025-05-07T20:33:09.9261642Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552370560>
2025-05-07T20:33:09.9262702Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.9264048Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05523904a0>}
2025-05-07T20:33:09.9265364Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.9266868Z context = <triton._C.libtriton.ir.context object at 0x7f05523678b0>
2025-05-07T20:33:09.9267164Z 
2025-05-07T20:33:09.9267332Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.9267848Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.9268317Z                            module_map=module_map)
2025-05-07T20:33:09.9268827Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.9269207Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.9269468Z E       ^
2025-05-07T20:33:09.9269929Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.9270379Z 
2025-05-07T20:33:09.9270791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.9271303Z 
2025-05-07T20:33:09.9271413Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.9271888Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.9272286Z     T=4096,
2025-05-07T20:33:09.9272484Z     D=5120,
2025-05-07T20:33:09.9272679Z     scale_ub=None,
2025-05-07T20:33:09.9272920Z     contiguous=False,
2025-05-07T20:33:09.9273150Z     compiled=True,
2025-05-07T20:33:09.9273364Z )
2025-05-07T20:33:09.9273689Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.9274185Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:09.9274459Z 
2025-05-07T20:33:09.9274537Z     @given(
2025-05-07T20:33:09.9274774Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.9275090Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.9275509Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.9275987Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.9276414Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.9276706Z     )
2025-05-07T20:33:09.9277059Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.9277503Z     def test_silu_mul_quant(
2025-05-07T20:33:09.9277744Z         self,
2025-05-07T20:33:09.9277939Z         T: int,
2025-05-07T20:33:09.9278144Z         D: int,
2025-05-07T20:33:09.9278361Z         scale_ub: Optional[float],
2025-05-07T20:33:09.9278648Z         contiguous: bool,
2025-05-07T20:33:09.9278890Z         compiled: bool,
2025-05-07T20:33:09.9279119Z     ) -> None:
2025-05-07T20:33:09.9279337Z         torch.manual_seed(2025)
2025-05-07T20:33:09.9279585Z     
2025-05-07T20:33:09.9279859Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.9280202Z     
2025-05-07T20:33:09.9280399Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.9280694Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.9280999Z         x = x_sign * x_clamp
2025-05-07T20:33:09.9281246Z         x0 = x[:, :D]
2025-05-07T20:33:09.9281464Z         x1 = x[:, D:]
2025-05-07T20:33:09.9281667Z     
2025-05-07T20:33:09.9281858Z         if contiguous:
2025-05-07T20:33:09.9282092Z             x0 = x0.contiguous()
2025-05-07T20:33:09.9282347Z             x1 = x1.contiguous()
2025-05-07T20:33:09.9282590Z     
2025-05-07T20:33:09.9282786Z         if scale_ub is not None:
2025-05-07T20:33:09.9283055Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.9283393Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.9283698Z             )
2025-05-07T20:33:09.9283893Z         else:
2025-05-07T20:33:09.9284100Z             scale_ub_tensor = None
2025-05-07T20:33:09.9284353Z     
2025-05-07T20:33:09.9284585Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.9284956Z             op = silu_mul_quant
2025-05-07T20:33:09.9285208Z             if compiled:
2025-05-07T20:33:09.9285455Z                 op = torch.compile(op)
2025-05-07T20:33:09.9285785Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.9286059Z     
2025-05-07T20:33:09.9286252Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.9286417Z 
2025-05-07T20:33:09.9286519Z moe/activation_test.py:117: 
2025-05-07T20:33:09.9286816Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.9287140Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.9287454Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.9288013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:09.9288569Z     return fn(*args, **kwargs)
2025-05-07T20:33:09.9289223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.9289896Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.9290463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.9291142Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.9291795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.9292315Z     kernel = self.compile(
2025-05-07T20:33:09.9292850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.9293591Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.9293983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.9294215Z 
2025-05-07T20:33:09.9294420Z self = <triton.compiler.compiler.ASTSource object at 0x7f05523721e0>
2025-05-07T20:33:09.9295494Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.9296839Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05523911c0>}
2025-05-07T20:33:09.9298154Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.9299158Z context = <triton._C.libtriton.ir.context object at 0x7f0552eb4a30>
2025-05-07T20:33:09.9299447Z 
2025-05-07T20:33:09.9299614Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.9300127Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.9300595Z                            module_map=module_map)
2025-05-07T20:33:09.9300956Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.9301313Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.9301578Z E       ^
2025-05-07T20:33:09.9302036Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.9302493Z 
2025-05-07T20:33:09.9302906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.9303415Z 
2025-05-07T20:33:10.2360090Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2360587Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2360997Z     T=4096,
2025-05-07T20:33:10.2361198Z     D=5120,
2025-05-07T20:33:10.2361397Z     scale_ub=1200.0,
2025-05-07T20:33:10.2361639Z     contiguous=False,
2025-05-07T20:33:10.2362130Z     compiled=False,
2025-05-07T20:33:10.2362350Z )
2025-05-07T20:33:10.2362668Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2363242Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:10.2363523Z 
2025-05-07T20:33:10.2363600Z     @given(
2025-05-07T20:33:10.2363833Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2364139Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2364442Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2364835Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2365152Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2365442Z     )
2025-05-07T20:33:10.2365791Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2366227Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2366461Z         self,
2025-05-07T20:33:10.2366663Z         T: int,
2025-05-07T20:33:10.2366860Z         D: int,
2025-05-07T20:33:10.2367075Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2367347Z         contiguous: bool,
2025-05-07T20:33:10.2367654Z         compiled: bool,
2025-05-07T20:33:10.2367882Z     ) -> None:
2025-05-07T20:33:10.2368097Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2368337Z     
2025-05-07T20:33:10.2368603Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2368938Z     
2025-05-07T20:33:10.2369126Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.2369410Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.2369720Z         x = x_sign * x_clamp
2025-05-07T20:33:10.2369961Z         x0 = x[:, :D]
2025-05-07T20:33:10.2370172Z         x1 = x[:, D:]
2025-05-07T20:33:10.2370383Z     
2025-05-07T20:33:10.2370569Z         if contiguous:
2025-05-07T20:33:10.2370804Z             x0 = x0.contiguous()
2025-05-07T20:33:10.2371059Z             x1 = x1.contiguous()
2025-05-07T20:33:10.2371303Z     
2025-05-07T20:33:10.2371494Z         if scale_ub is not None:
2025-05-07T20:33:10.2371759Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.2372093Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.2372398Z             )
2025-05-07T20:33:10.2372590Z         else:
2025-05-07T20:33:10.2372805Z             scale_ub_tensor = None
2025-05-07T20:33:10.2373147Z     
2025-05-07T20:33:10.2373378Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.2373690Z             op = silu_mul_quant
2025-05-07T20:33:10.2373941Z             if compiled:
2025-05-07T20:33:10.2374181Z                 op = torch.compile(op)
2025-05-07T20:33:10.2374477Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.2374750Z     
2025-05-07T20:33:10.2374934Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.2375099Z 
2025-05-07T20:33:10.2375198Z moe/activation_test.py:117: 
2025-05-07T20:33:10.2375493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.2375824Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.2376133Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.2382084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.2382774Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.2383302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.2383977Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.2384636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.2385165Z     kernel = self.compile(
2025-05-07T20:33:10.2385698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.2386430Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.2386860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.2387088Z 
2025-05-07T20:33:10.2387291Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552373710>
2025-05-07T20:33:10.2388360Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.2389801Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552392160>}
2025-05-07T20:33:10.2391108Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.2392116Z context = <triton._C.libtriton.ir.context object at 0x7f03a1d50f70>
2025-05-07T20:33:10.2392398Z 
2025-05-07T20:33:10.2392622Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.2393138Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.2393600Z                            module_map=module_map)
2025-05-07T20:33:10.2393974Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.2394324Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.2394584Z E       ^
2025-05-07T20:33:10.2395049Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.2395488Z 
2025-05-07T20:33:10.2395893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.2396394Z 
2025-05-07T20:33:10.2396499Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2396901Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2397302Z     T=4096,
2025-05-07T20:33:10.2397485Z     D=5120,
2025-05-07T20:33:10.2397673Z     scale_ub=1200.0,
2025-05-07T20:33:10.2397893Z     contiguous=False,
2025-05-07T20:33:10.2398108Z     compiled=True,
2025-05-07T20:33:10.2398305Z )
2025-05-07T20:33:10.2398617Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2399097Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:10.2399395Z 
2025-05-07T20:33:10.2399487Z     @given(
2025-05-07T20:33:10.2399721Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2400017Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2400316Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2400638Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2400958Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2401231Z     )
2025-05-07T20:33:10.2401581Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2402010Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2402247Z         self,
2025-05-07T20:33:10.2402443Z         T: int,
2025-05-07T20:33:10.2402642Z         D: int,
2025-05-07T20:33:10.2402847Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2403114Z         contiguous: bool,
2025-05-07T20:33:10.2403357Z         compiled: bool,
2025-05-07T20:33:10.2403578Z     ) -> None:
2025-05-07T20:33:10.2403791Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2404027Z     
2025-05-07T20:33:10.2404288Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2404627Z     
2025-05-07T20:33:10.2404816Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.2405108Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.2405459Z         x = x_sign * x_clamp
2025-05-07T20:33:10.2405697Z         x0 = x[:, :D]
2025-05-07T20:33:10.2405915Z         x1 = x[:, D:]
2025-05-07T20:33:10.2406120Z     
2025-05-07T20:33:10.2406347Z         if contiguous:
2025-05-07T20:33:10.2406578Z             x0 = x0.contiguous()
2025-05-07T20:33:10.2406827Z             x1 = x1.contiguous()
2025-05-07T20:33:10.2407065Z     
2025-05-07T20:33:10.2407257Z         if scale_ub is not None:
2025-05-07T20:33:10.2407521Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.2407846Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.2408194Z             )
2025-05-07T20:33:10.2408381Z         else:
2025-05-07T20:33:10.2408589Z             scale_ub_tensor = None
2025-05-07T20:33:10.2408838Z     
2025-05-07T20:33:10.2409061Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.2409370Z             op = silu_mul_quant
2025-05-07T20:33:10.2409616Z             if compiled:
2025-05-07T20:33:10.2409856Z                 op = torch.compile(op)
2025-05-07T20:33:10.2410151Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.2410421Z     
2025-05-07T20:33:10.2410659Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.2410819Z 
2025-05-07T20:33:10.2410918Z moe/activation_test.py:117: 
2025-05-07T20:33:10.2411209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.2411531Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.2411801Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.2412349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.2412900Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.2413619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.2414290Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.2414809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.2415481Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.2416134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.2416654Z     kernel = self.compile(
2025-05-07T20:33:10.2417197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.2417834Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.2418220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.2418445Z 
2025-05-07T20:33:10.2418645Z self = <triton.compiler.compiler.ASTSource object at 0x7f05523732c0>
2025-05-07T20:33:10.2419748Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.2421096Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552393240>}
2025-05-07T20:33:10.2422400Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.2423397Z context = <triton._C.libtriton.ir.context object at 0x7f0552e69870>
2025-05-07T20:33:10.2423683Z 
2025-05-07T20:33:10.2423846Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.2424355Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.2424807Z                            module_map=module_map)
2025-05-07T20:33:10.2425216Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.2425564Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.2425821Z E       ^
2025-05-07T20:33:10.2426313Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.2426757Z 
2025-05-07T20:33:10.2427164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.2427662Z 
2025-05-07T20:33:10.3552691Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.3553264Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.3553665Z     T=2048,
2025-05-07T20:33:10.3553861Z     D=7168,
2025-05-07T20:33:10.3554057Z     scale_ub=1200.0,
2025-05-07T20:33:10.3554278Z     contiguous=False,
2025-05-07T20:33:10.3554512Z     compiled=False,
2025-05-07T20:33:10.3554724Z )
2025-05-07T20:33:10.3555058Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.3555555Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:10.3555834Z 
2025-05-07T20:33:10.3556005Z     @given(
2025-05-07T20:33:10.3556235Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.3556547Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.3556849Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.3557172Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.3557494Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.3557784Z     )
2025-05-07T20:33:10.3558137Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.3558564Z     def test_silu_mul_quant(
2025-05-07T20:33:10.3558806Z         self,
2025-05-07T20:33:10.3559000Z         T: int,
2025-05-07T20:33:10.3559399Z         D: int,
2025-05-07T20:33:10.3559616Z         scale_ub: Optional[float],
2025-05-07T20:33:10.3559889Z         contiguous: bool,
2025-05-07T20:33:10.3560126Z         compiled: bool,
2025-05-07T20:33:10.3560342Z     ) -> None:
2025-05-07T20:33:10.3560560Z         torch.manual_seed(2025)
2025-05-07T20:33:10.3560803Z     
2025-05-07T20:33:10.3561073Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.3561413Z     
2025-05-07T20:33:10.3561611Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.3561895Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.3562201Z         x = x_sign * x_clamp
2025-05-07T20:33:10.3562439Z         x0 = x[:, :D]
2025-05-07T20:33:10.3562650Z         x1 = x[:, D:]
2025-05-07T20:33:10.3562863Z     
2025-05-07T20:33:10.3563048Z         if contiguous:
2025-05-07T20:33:10.3563274Z             x0 = x0.contiguous()
2025-05-07T20:33:10.3563532Z             x1 = x1.contiguous()
2025-05-07T20:33:10.3563776Z     
2025-05-07T20:33:10.3563964Z         if scale_ub is not None:
2025-05-07T20:33:10.3564245Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.3564574Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.3564884Z             )
2025-05-07T20:33:10.3565074Z         else:
2025-05-07T20:33:10.3565289Z             scale_ub_tensor = None
2025-05-07T20:33:10.3565539Z     
2025-05-07T20:33:10.3565767Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.3566082Z             op = silu_mul_quant
2025-05-07T20:33:10.3566331Z             if compiled:
2025-05-07T20:33:10.3566573Z                 op = torch.compile(op)
2025-05-07T20:33:10.3566872Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.3567143Z     
2025-05-07T20:33:10.3567340Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.3567501Z 
2025-05-07T20:33:10.3567632Z moe/activation_test.py:117: 
2025-05-07T20:33:10.3567926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.3568260Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.3568612Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.3569347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.3570031Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.3570555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.3571224Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.3571877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.3572462Z     kernel = self.compile(
2025-05-07T20:33:10.3573062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.3573706Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.3574098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.3574327Z 
2025-05-07T20:33:10.3574535Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552e944d0>
2025-05-07T20:33:10.3575659Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.3577007Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552ea4220>}
2025-05-07T20:33:10.3578324Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.3579380Z context = <triton._C.libtriton.ir.context object at 0x7f03a1eccab0>
2025-05-07T20:33:10.3579664Z 
2025-05-07T20:33:10.3579826Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.3580342Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.3580802Z                            module_map=module_map)
2025-05-07T20:33:10.3581165Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.3581516Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.3581772Z E       ^
2025-05-07T20:33:10.3582226Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.3582672Z 
2025-05-07T20:33:10.3583085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.3583590Z 
2025-05-07T20:33:10.3583694Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.3584103Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.3584511Z     T=1,
2025-05-07T20:33:10.3584696Z     D=7168,
2025-05-07T20:33:10.3584898Z     scale_ub=None,
2025-05-07T20:33:10.3585110Z     contiguous=True,
2025-05-07T20:33:10.3585335Z     compiled=False,
2025-05-07T20:33:10.3585542Z )
2025-05-07T20:33:10.3585860Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.3586335Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:10.3586596Z 
2025-05-07T20:33:10.3586674Z     @given(
2025-05-07T20:33:10.3586903Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.3587212Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.3587512Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.3587843Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.3588166Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.3588449Z     )
2025-05-07T20:33:10.3588844Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.3589314Z     def test_silu_mul_quant(
2025-05-07T20:33:10.3589609Z         self,
2025-05-07T20:33:10.3589812Z         T: int,
2025-05-07T20:33:10.3590012Z         D: int,
2025-05-07T20:33:10.3590226Z         scale_ub: Optional[float],
2025-05-07T20:33:10.3590496Z         contiguous: bool,
2025-05-07T20:33:10.3590739Z         compiled: bool,
2025-05-07T20:33:10.3590951Z     ) -> None:
2025-05-07T20:33:10.3591167Z         torch.manual_seed(2025)
2025-05-07T20:33:10.3591407Z     
2025-05-07T20:33:10.3591717Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.3592050Z     
2025-05-07T20:33:10.3592244Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.3592527Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.3592833Z         x = x_sign * x_clamp
2025-05-07T20:33:10.3593075Z         x0 = x[:, :D]
2025-05-07T20:33:10.3593285Z         x1 = x[:, D:]
2025-05-07T20:33:10.3593495Z     
2025-05-07T20:33:10.3593680Z         if contiguous:
2025-05-07T20:33:10.3593905Z             x0 = x0.contiguous()
2025-05-07T20:33:10.3594207Z             x1 = x1.contiguous()
2025-05-07T20:33:10.3594449Z     
2025-05-07T20:33:10.3594637Z         if scale_ub is not None:
2025-05-07T20:33:10.3594918Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.3595246Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.3595549Z             )
2025-05-07T20:33:10.3595737Z         else:
2025-05-07T20:33:10.3595949Z             scale_ub_tensor = None
2025-05-07T20:33:10.3596201Z     
2025-05-07T20:33:10.3596427Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.3596738Z             op = silu_mul_quant
2025-05-07T20:33:10.3596987Z             if compiled:
2025-05-07T20:33:10.3597232Z                 op = torch.compile(op)
2025-05-07T20:33:10.3597530Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.3597805Z     
2025-05-07T20:33:10.3597992Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.3598159Z 
2025-05-07T20:33:10.3598254Z moe/activation_test.py:117: 
2025-05-07T20:33:10.3598553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.3598882Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.3599177Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.3599888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.3600563Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.3601087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.3601757Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.3602410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.3602935Z     kernel = self.compile(
2025-05-07T20:33:10.3603472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.3604123Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.3604512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.3604735Z 
2025-05-07T20:33:10.3604943Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552e96660>
2025-05-07T20:33:10.3605998Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.3607352Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552ea5120>}
2025-05-07T20:33:10.3608758Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.3609822Z context = <triton._C.libtriton.ir.context object at 0x7f05539007f0>
2025-05-07T20:33:10.3610104Z 
2025-05-07T20:33:10.3610271Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.3610790Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.3611294Z                            module_map=module_map)
2025-05-07T20:33:10.3611659Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.3612007Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.3612263Z E       ^
2025-05-07T20:33:10.3612720Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.3613209Z 
2025-05-07T20:33:10.3613618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.3614127Z 
2025-05-07T20:33:10.3614276Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.3614689Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.3615086Z     T=16384,
2025-05-07T20:33:10.3615273Z     D=7168,
2025-05-07T20:33:10.3615469Z     scale_ub=1200.0,
2025-05-07T20:33:10.3615689Z     contiguous=False,
2025-05-07T20:33:10.3615909Z     compiled=True,
2025-05-07T20:33:10.5996778Z )
2025-05-07T20:33:10.5997123Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.5997663Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:10.5997957Z 
2025-05-07T20:33:10.5998040Z     @given(
2025-05-07T20:33:10.5998269Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.5998588Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.5998960Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.5999633Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.6000285Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.6000853Z     )
2025-05-07T20:33:10.6001547Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.6002431Z     def test_silu_mul_quant(
2025-05-07T20:33:10.6002922Z         self,
2025-05-07T20:33:10.6003318Z         T: int,
2025-05-07T20:33:10.6003712Z         D: int,
2025-05-07T20:33:10.6004159Z         scale_ub: Optional[float],
2025-05-07T20:33:10.6004696Z         contiguous: bool,
2025-05-07T20:33:10.6005173Z         compiled: bool,
2025-05-07T20:33:10.6005630Z     ) -> None:
2025-05-07T20:33:10.6006065Z         torch.manual_seed(2025)
2025-05-07T20:33:10.6006545Z     
2025-05-07T20:33:10.6007092Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.6007780Z     
2025-05-07T20:33:10.6008165Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.6008750Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.6009254Z         x = x_sign * x_clamp
2025-05-07T20:33:10.6009495Z         x0 = x[:, :D]
2025-05-07T20:33:10.6009715Z         x1 = x[:, D:]
2025-05-07T20:33:10.6009930Z     
2025-05-07T20:33:10.6010116Z         if contiguous:
2025-05-07T20:33:10.6010353Z             x0 = x0.contiguous()
2025-05-07T20:33:10.6010617Z             x1 = x1.contiguous()
2025-05-07T20:33:10.6010855Z     
2025-05-07T20:33:10.6011058Z         if scale_ub is not None:
2025-05-07T20:33:10.6011339Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.6011676Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.6011983Z             )
2025-05-07T20:33:10.6012187Z         else:
2025-05-07T20:33:10.6012404Z             scale_ub_tensor = None
2025-05-07T20:33:10.6012660Z     
2025-05-07T20:33:10.6013093Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.6013416Z             op = silu_mul_quant
2025-05-07T20:33:10.6013732Z             if compiled:
2025-05-07T20:33:10.6013990Z                 op = torch.compile(op)
2025-05-07T20:33:10.6014292Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.6014561Z     
2025-05-07T20:33:10.6014756Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.6014922Z 
2025-05-07T20:33:10.6015051Z moe/activation_test.py:117: 
2025-05-07T20:33:10.6015351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.6015774Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.6016061Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.6016618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.6017177Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.6017825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.6018508Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.6019114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.6019838Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.6020492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.6021023Z     kernel = self.compile(
2025-05-07T20:33:10.6021563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.6022207Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.6022600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.6022830Z 
2025-05-07T20:33:10.6023037Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552e95c70>
2025-05-07T20:33:10.6024109Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.6025457Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552ea6520>}
2025-05-07T20:33:10.6026779Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.6027788Z context = <triton._C.libtriton.ir.context object at 0x7f055398bb30>
2025-05-07T20:33:10.6028071Z 
2025-05-07T20:33:10.6028238Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.6028750Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.6029236Z                            module_map=module_map)
2025-05-07T20:33:10.6029622Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.6029973Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.6030226Z E       ^
2025-05-07T20:33:10.6030685Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.6031126Z 
2025-05-07T20:33:10.6031542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.6032045Z 
2025-05-07T20:33:10.6032153Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.6032561Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.6032960Z     T=1,
2025-05-07T20:33:10.6033144Z     D=7168,
2025-05-07T20:33:10.6033384Z     scale_ub=None,
2025-05-07T20:33:10.6033596Z     contiguous=False,
2025-05-07T20:33:10.6033827Z     compiled=False,
2025-05-07T20:33:10.6034028Z )
2025-05-07T20:33:10.6034390Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.6034875Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:10.6035132Z 
2025-05-07T20:33:10.6035211Z     @given(
2025-05-07T20:33:10.6035447Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.6035754Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.6036100Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.6036421Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.6036746Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.6037026Z     )
2025-05-07T20:33:10.6037367Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.6037804Z     def test_silu_mul_quant(
2025-05-07T20:33:10.6043227Z         self,
2025-05-07T20:33:10.6043442Z         T: int,
2025-05-07T20:33:10.6043645Z         D: int,
2025-05-07T20:33:10.6043942Z         scale_ub: Optional[float],
2025-05-07T20:33:10.6044218Z         contiguous: bool,
2025-05-07T20:33:10.6044456Z         compiled: bool,
2025-05-07T20:33:10.6044685Z     ) -> None:
2025-05-07T20:33:10.6044902Z         torch.manual_seed(2025)
2025-05-07T20:33:10.6045138Z     
2025-05-07T20:33:10.6045406Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.6045750Z     
2025-05-07T20:33:10.6045943Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.6046228Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.6046527Z         x = x_sign * x_clamp
2025-05-07T20:33:10.6046756Z         x0 = x[:, :D]
2025-05-07T20:33:10.6046971Z         x1 = x[:, D:]
2025-05-07T20:33:10.6047178Z     
2025-05-07T20:33:10.6047361Z         if contiguous:
2025-05-07T20:33:10.6047591Z             x0 = x0.contiguous()
2025-05-07T20:33:10.6047843Z             x1 = x1.contiguous()
2025-05-07T20:33:10.6048076Z     
2025-05-07T20:33:10.6048270Z         if scale_ub is not None:
2025-05-07T20:33:10.6048543Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.6048874Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.6049182Z             )
2025-05-07T20:33:10.6049372Z         else:
2025-05-07T20:33:10.6049586Z             scale_ub_tensor = None
2025-05-07T20:33:10.6049828Z     
2025-05-07T20:33:10.6050061Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.6050374Z             op = silu_mul_quant
2025-05-07T20:33:10.6050614Z             if compiled:
2025-05-07T20:33:10.6050859Z                 op = torch.compile(op)
2025-05-07T20:33:10.6051154Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.6051420Z     
2025-05-07T20:33:10.6051608Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.6051766Z 
2025-05-07T20:33:10.6051867Z moe/activation_test.py:117: 
2025-05-07T20:33:10.6052156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.6052484Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.6052760Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.6053490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.6054164Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.6054689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.6055361Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.6056008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.6056534Z     kernel = self.compile(
2025-05-07T20:33:10.6057063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.6057754Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.6058181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.6058413Z 
2025-05-07T20:33:10.6058616Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552e97380>
2025-05-07T20:33:10.6059981Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.6061477Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552ea7100>}
2025-05-07T20:33:10.6062781Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.6063853Z context = <triton._C.libtriton.ir.context object at 0x7f03a1eb2eb0>
2025-05-07T20:33:10.6064142Z 
2025-05-07T20:33:10.6064305Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.6064816Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.6065274Z                            module_map=module_map)
2025-05-07T20:33:10.6065636Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.6065986Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.6066245Z E       ^
2025-05-07T20:33:10.6066699Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.6067143Z 
2025-05-07T20:33:10.6067552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.6068054Z 
2025-05-07T20:33:10.6068167Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.6068581Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.6068974Z     T=2048,
2025-05-07T20:33:10.6069166Z     D=7168,
2025-05-07T20:33:10.6069355Z     scale_ub=None,
2025-05-07T20:33:10.6069567Z     contiguous=False,
2025-05-07T20:33:10.6069790Z     compiled=True,
2025-05-07T20:33:10.6069993Z )
2025-05-07T20:33:10.6924018Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.6925056Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:10.6925598Z 
2025-05-07T20:33:10.6925751Z     @given(
2025-05-07T20:33:10.6926210Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.6926817Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.6927420Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.6928071Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.6928718Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.6929155Z     )
2025-05-07T20:33:10.6929511Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.6929952Z     def test_silu_mul_quant(
2025-05-07T20:33:10.6930192Z         self,
2025-05-07T20:33:10.6930395Z         T: int,
2025-05-07T20:33:10.6930596Z         D: int,
2025-05-07T20:33:10.6930806Z         scale_ub: Optional[float],
2025-05-07T20:33:10.6931073Z         contiguous: bool,
2025-05-07T20:33:10.6931323Z         compiled: bool,
2025-05-07T20:33:10.6931539Z     ) -> None:
2025-05-07T20:33:10.6931755Z         torch.manual_seed(2025)
2025-05-07T20:33:10.6931999Z     
2025-05-07T20:33:10.6932267Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.6932603Z     
2025-05-07T20:33:10.6932794Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.6933165Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.6933581Z         x = x_sign * x_clamp
2025-05-07T20:33:10.6933825Z         x0 = x[:, :D]
2025-05-07T20:33:10.6934107Z         x1 = x[:, D:]
2025-05-07T20:33:10.6934318Z     
2025-05-07T20:33:10.6934508Z         if contiguous:
2025-05-07T20:33:10.6934744Z             x0 = x0.contiguous()
2025-05-07T20:33:10.6935004Z             x1 = x1.contiguous()
2025-05-07T20:33:10.6935253Z     
2025-05-07T20:33:10.6935450Z         if scale_ub is not None:
2025-05-07T20:33:10.6935723Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.6936129Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.6936444Z             )
2025-05-07T20:33:10.6936643Z         else:
2025-05-07T20:33:10.6936855Z             scale_ub_tensor = None
2025-05-07T20:33:10.6937119Z     
2025-05-07T20:33:10.6937352Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.6937668Z             op = silu_mul_quant
2025-05-07T20:33:10.6937925Z             if compiled:
2025-05-07T20:33:10.6938177Z                 op = torch.compile(op)
2025-05-07T20:33:10.6938482Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.6938831Z     
2025-05-07T20:33:10.6939031Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.6939212Z 
2025-05-07T20:33:10.6939319Z moe/activation_test.py:117: 
2025-05-07T20:33:10.6939639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.6939966Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.6940246Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.6940802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.6941361Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.6942008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.6942687Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.6943220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.6943893Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.6944552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.6945078Z     kernel = self.compile(
2025-05-07T20:33:10.6945613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.6946260Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.6946650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.6946881Z 
2025-05-07T20:33:10.6947087Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1e88980>
2025-05-07T20:33:10.6948152Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.6949548Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1edc720>}
2025-05-07T20:33:10.6950863Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.6951878Z context = <triton._C.libtriton.ir.context object at 0x7f03a1eac3f0>
2025-05-07T20:33:10.6952170Z 
2025-05-07T20:33:10.6952337Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.6952854Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.6953377Z                            module_map=module_map)
2025-05-07T20:33:10.6953742Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.6954160Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.6954422Z E       ^
2025-05-07T20:33:10.6954884Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.6955339Z 
2025-05-07T20:33:10.6955752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.6956258Z 
2025-05-07T20:33:10.6956410Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.6956821Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.6957228Z     T=4096,
2025-05-07T20:33:10.6957426Z     D=7168,
2025-05-07T20:33:10.6957620Z     scale_ub=None,
2025-05-07T20:33:10.6957831Z     contiguous=False,
2025-05-07T20:33:10.6958059Z     compiled=True,
2025-05-07T20:33:10.6958262Z )
2025-05-07T20:33:10.6958584Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.6959123Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:10.6959561Z 
2025-05-07T20:33:10.6959646Z     @given(
2025-05-07T20:33:10.6959875Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.6960194Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.6960496Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.6960827Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.6961159Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.6961446Z     )
2025-05-07T20:33:10.6961791Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.6962236Z     def test_silu_mul_quant(
2025-05-07T20:33:10.6962484Z         self,
2025-05-07T20:33:10.6962677Z         T: int,
2025-05-07T20:33:10.6962875Z         D: int,
2025-05-07T20:33:10.6963099Z         scale_ub: Optional[float],
2025-05-07T20:33:10.6963368Z         contiguous: bool,
2025-05-07T20:33:10.6963615Z         compiled: bool,
2025-05-07T20:33:10.6963851Z     ) -> None:
2025-05-07T20:33:10.6964060Z         torch.manual_seed(2025)
2025-05-07T20:33:10.6964306Z     
2025-05-07T20:33:10.6964584Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.6964920Z     
2025-05-07T20:33:10.6965115Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.6965404Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.6965715Z         x = x_sign * x_clamp
2025-05-07T20:33:10.6965958Z         x0 = x[:, :D]
2025-05-07T20:33:10.6966194Z         x1 = x[:, D:]
2025-05-07T20:33:10.6966404Z     
2025-05-07T20:33:10.6966590Z         if contiguous:
2025-05-07T20:33:10.6966826Z             x0 = x0.contiguous()
2025-05-07T20:33:10.6967087Z             x1 = x1.contiguous()
2025-05-07T20:33:10.6967324Z     
2025-05-07T20:33:10.6967521Z         if scale_ub is not None:
2025-05-07T20:33:10.6967798Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.6968138Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.6968449Z             )
2025-05-07T20:33:10.6968652Z         else:
2025-05-07T20:33:10.6968866Z             scale_ub_tensor = None
2025-05-07T20:33:10.6969125Z     
2025-05-07T20:33:10.6969396Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.6969726Z             op = silu_mul_quant
2025-05-07T20:33:10.6969982Z             if compiled:
2025-05-07T20:33:10.6970245Z                 op = torch.compile(op)
2025-05-07T20:33:10.6970547Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.6970817Z     
2025-05-07T20:33:10.6971013Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.6971177Z 
2025-05-07T20:33:10.6971279Z moe/activation_test.py:117: 
2025-05-07T20:33:10.6971570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.6971974Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.6972266Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.6972872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.6973471Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.6974126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.6974813Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.6975344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.6976080Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.6976736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.6977259Z     kernel = self.compile(
2025-05-07T20:33:10.6977799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.6978510Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.6978907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.6979138Z 
2025-05-07T20:33:10.6979363Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1e8bc80>
2025-05-07T20:33:10.6980449Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.6981794Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1edd440>}
2025-05-07T20:33:10.6983113Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.6984129Z context = <triton._C.libtriton.ir.context object at 0x7f03a1e95730>
2025-05-07T20:33:10.6984412Z 
2025-05-07T20:33:10.6984579Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.6985101Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.6985566Z                            module_map=module_map)
2025-05-07T20:33:10.6985929Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.6986286Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.6986549Z E       ^
2025-05-07T20:33:10.6987010Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.6987452Z 
2025-05-07T20:33:10.6987863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.6988370Z 
2025-05-07T20:33:10.8552132Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.8552982Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.8553776Z     T=16384,
2025-05-07T20:33:10.8554211Z     D=5120,
2025-05-07T20:33:10.8554594Z     scale_ub=1200.0,
2025-05-07T20:33:10.8555042Z     contiguous=False,
2025-05-07T20:33:10.8555482Z     compiled=False,
2025-05-07T20:33:10.8555885Z )
2025-05-07T20:33:10.8556508Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.8557493Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:10.8558044Z 
2025-05-07T20:33:10.8558198Z     @given(
2025-05-07T20:33:10.8558639Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.8559182Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.8559794Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.8560121Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.8560522Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.8560810Z     )
2025-05-07T20:33:10.8561160Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.8561596Z     def test_silu_mul_quant(
2025-05-07T20:33:10.8561842Z         self,
2025-05-07T20:33:10.8562039Z         T: int,
2025-05-07T20:33:10.8562228Z         D: int,
2025-05-07T20:33:10.8562449Z         scale_ub: Optional[float],
2025-05-07T20:33:10.8562784Z         contiguous: bool,
2025-05-07T20:33:10.8563020Z         compiled: bool,
2025-05-07T20:33:10.8563247Z     ) -> None:
2025-05-07T20:33:10.8563465Z         torch.manual_seed(2025)
2025-05-07T20:33:10.8563701Z     
2025-05-07T20:33:10.8563971Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.8564310Z     
2025-05-07T20:33:10.8564509Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.8564803Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.8565116Z         x = x_sign * x_clamp
2025-05-07T20:33:10.8565415Z         x0 = x[:, :D]
2025-05-07T20:33:10.8565640Z         x1 = x[:, D:]
2025-05-07T20:33:10.8565851Z     
2025-05-07T20:33:10.8566040Z         if contiguous:
2025-05-07T20:33:10.8566276Z             x0 = x0.contiguous()
2025-05-07T20:33:10.8566535Z             x1 = x1.contiguous()
2025-05-07T20:33:10.8566788Z     
2025-05-07T20:33:10.8566975Z         if scale_ub is not None:
2025-05-07T20:33:10.8567254Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.8567600Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.8567936Z             )
2025-05-07T20:33:10.8568135Z         else:
2025-05-07T20:33:10.8568357Z             scale_ub_tensor = None
2025-05-07T20:33:10.8568622Z     
2025-05-07T20:33:10.8568864Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.8569214Z             op = silu_mul_quant
2025-05-07T20:33:10.8569527Z             if compiled:
2025-05-07T20:33:10.8569792Z                 op = torch.compile(op)
2025-05-07T20:33:10.8570118Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.8570415Z     
2025-05-07T20:33:10.8570613Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.8570794Z 
2025-05-07T20:33:10.8570899Z moe/activation_test.py:117: 
2025-05-07T20:33:10.8571227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.8571593Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.8571910Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.8572720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.8573514Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.8574045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.8574730Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.8575394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.8575915Z     kernel = self.compile(
2025-05-07T20:33:10.8576452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.8577099Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.8577497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.8577731Z 
2025-05-07T20:33:10.8577936Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1e8bc50>
2025-05-07T20:33:10.8578997Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.8580535Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1ede340>}
2025-05-07T20:33:10.8581856Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.8582861Z context = <triton._C.libtriton.ir.context object at 0x7f03a1c122b0>
2025-05-07T20:33:10.8583191Z 
2025-05-07T20:33:10.8583359Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.8583876Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.8584339Z                            module_map=module_map)
2025-05-07T20:33:10.8584702Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.8585062Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.8585327Z E       ^
2025-05-07T20:33:10.8585826Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.8586276Z 
2025-05-07T20:33:10.8586690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.8587196Z 
2025-05-07T20:33:10.8587303Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.8587719Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.8588117Z     T=16384,
2025-05-07T20:33:10.8588315Z     D=5120,
2025-05-07T20:33:10.8588509Z     scale_ub=1200.0,
2025-05-07T20:33:10.8588734Z     contiguous=True,
2025-05-07T20:33:10.8588958Z     compiled=True,
2025-05-07T20:33:10.8589168Z )
2025-05-07T20:33:10.8589484Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.8589980Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:10.8590254Z 
2025-05-07T20:33:10.8590342Z     @given(
2025-05-07T20:33:10.8590585Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.8590896Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.8591207Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.8591536Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.8591862Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.8592149Z     )
2025-05-07T20:33:10.8592505Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.8592940Z     def test_silu_mul_quant(
2025-05-07T20:33:10.8593186Z         self,
2025-05-07T20:33:10.8593386Z         T: int,
2025-05-07T20:33:10.8593586Z         D: int,
2025-05-07T20:33:10.8593804Z         scale_ub: Optional[float],
2025-05-07T20:33:10.8594074Z         contiguous: bool,
2025-05-07T20:33:10.8594315Z         compiled: bool,
2025-05-07T20:33:10.8594540Z     ) -> None:
2025-05-07T20:33:10.8594760Z         torch.manual_seed(2025)
2025-05-07T20:33:10.8595007Z     
2025-05-07T20:33:10.8595280Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.8595616Z     
2025-05-07T20:33:10.8595811Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.8596096Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.8596406Z         x = x_sign * x_clamp
2025-05-07T20:33:10.8596652Z         x0 = x[:, :D]
2025-05-07T20:33:10.8596868Z         x1 = x[:, D:]
2025-05-07T20:33:10.8597080Z     
2025-05-07T20:33:10.8597274Z         if contiguous:
2025-05-07T20:33:10.8597502Z             x0 = x0.contiguous()
2025-05-07T20:33:10.8597765Z             x1 = x1.contiguous()
2025-05-07T20:33:10.8598011Z     
2025-05-07T20:33:10.8598201Z         if scale_ub is not None:
2025-05-07T20:33:10.8598476Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.8598880Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.8599183Z             )
2025-05-07T20:33:10.8599404Z         else:
2025-05-07T20:33:10.8599684Z             scale_ub_tensor = None
2025-05-07T20:33:10.8599931Z     
2025-05-07T20:33:10.8600160Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.8600476Z             op = silu_mul_quant
2025-05-07T20:33:10.8600723Z             if compiled:
2025-05-07T20:33:10.8600971Z                 op = torch.compile(op)
2025-05-07T20:33:10.8601267Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.8601583Z     
2025-05-07T20:33:10.8601775Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.8601943Z 
2025-05-07T20:33:10.8602044Z moe/activation_test.py:117: 
2025-05-07T20:33:10.8602340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.8602669Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.8602953Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.8603509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.8604104Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.8604752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.8605431Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.8605959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.8606627Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.8607282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.8607803Z     kernel = self.compile(
2025-05-07T20:33:10.8608335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.8608974Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.8609372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.8609613Z 
2025-05-07T20:33:10.8609858Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1e8be30>
2025-05-07T20:33:10.8616145Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.8617510Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1edf9c0>}
2025-05-07T20:33:10.8618829Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.8619835Z context = <triton._C.libtriton.ir.context object at 0x7f03a1ca7730>
2025-05-07T20:33:10.8620120Z 
2025-05-07T20:33:10.8620294Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.8620807Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.8621270Z                            module_map=module_map)
2025-05-07T20:33:10.8621627Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.8621976Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.8622234Z E       ^
2025-05-07T20:33:10.8622689Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.8623135Z 
2025-05-07T20:33:10.8623547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.8624121Z 
2025-05-07T20:33:11.0304254Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.0304845Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.0305255Z     T=16384,
2025-05-07T20:33:11.0305440Z     D=5120,
2025-05-07T20:33:11.0305632Z     scale_ub=None,
2025-05-07T20:33:11.0305846Z     contiguous=False,
2025-05-07T20:33:11.0306071Z     compiled=True,
2025-05-07T20:33:11.0306272Z )
2025-05-07T20:33:11.0306581Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.0307075Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:11.0307434Z 
2025-05-07T20:33:11.0307516Z     @given(
2025-05-07T20:33:11.0307746Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.0308050Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.0308346Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.0308674Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.0309005Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.0309277Z     )
2025-05-07T20:33:11.0309694Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.0310131Z     def test_silu_mul_quant(
2025-05-07T20:33:11.0310366Z         self,
2025-05-07T20:33:11.0310559Z         T: int,
2025-05-07T20:33:11.0310754Z         D: int,
2025-05-07T20:33:11.0310968Z         scale_ub: Optional[float],
2025-05-07T20:33:11.0311230Z         contiguous: bool,
2025-05-07T20:33:11.0311462Z         compiled: bool,
2025-05-07T20:33:11.0311681Z     ) -> None:
2025-05-07T20:33:11.0311893Z         torch.manual_seed(2025)
2025-05-07T20:33:11.0312128Z     
2025-05-07T20:33:11.0312403Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.0312739Z     
2025-05-07T20:33:11.0312933Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.0313223Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.0313529Z         x = x_sign * x_clamp
2025-05-07T20:33:11.0313770Z         x0 = x[:, :D]
2025-05-07T20:33:11.0313982Z         x1 = x[:, D:]
2025-05-07T20:33:11.0314185Z     
2025-05-07T20:33:11.0314373Z         if contiguous:
2025-05-07T20:33:11.0314600Z             x0 = x0.contiguous()
2025-05-07T20:33:11.0314854Z             x1 = x1.contiguous()
2025-05-07T20:33:11.0315094Z     
2025-05-07T20:33:11.0315288Z         if scale_ub is not None:
2025-05-07T20:33:11.0315556Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.0315887Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.0316195Z             )
2025-05-07T20:33:11.0316387Z         else:
2025-05-07T20:33:11.0316594Z             scale_ub_tensor = None
2025-05-07T20:33:11.0316844Z     
2025-05-07T20:33:11.0317077Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.0317383Z             op = silu_mul_quant
2025-05-07T20:33:11.0317629Z             if compiled:
2025-05-07T20:33:11.0317880Z                 op = torch.compile(op)
2025-05-07T20:33:11.0318164Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.0318437Z     
2025-05-07T20:33:11.0318631Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.0318792Z 
2025-05-07T20:33:11.0318893Z moe/activation_test.py:117: 
2025-05-07T20:33:11.0319186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.0319563Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.0319846Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.0320393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.0320948Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.0321592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.0322261Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.0322786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.0323572Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.0324220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.0324738Z     kernel = self.compile(
2025-05-07T20:33:11.0325269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.0325910Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.0326336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.0326567Z 
2025-05-07T20:33:11.0326770Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1c5f830>
2025-05-07T20:33:11.0327826Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.0329209Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1c00c20>}
2025-05-07T20:33:11.0330580Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.0331587Z context = <triton._C.libtriton.ir.context object at 0x7f03a1ba6830>
2025-05-07T20:33:11.0331868Z 
2025-05-07T20:33:11.0332030Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.0332541Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.0333071Z                            module_map=module_map)
2025-05-07T20:33:11.0333430Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.0333776Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.0334038Z E       ^
2025-05-07T20:33:11.0334508Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.0334957Z 
2025-05-07T20:33:11.0335369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.0335874Z 
2025-05-07T20:33:11.0335975Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.0336383Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.0336773Z     T=2048,
2025-05-07T20:33:11.0336958Z     D=5120,
2025-05-07T20:33:11.0337148Z     scale_ub=None,
2025-05-07T20:33:11.0337356Z     contiguous=False,
2025-05-07T20:33:11.0337580Z     compiled=True,
2025-05-07T20:33:11.0337780Z )
2025-05-07T20:33:11.1230797Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.1231872Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:11.1232423Z 
2025-05-07T20:33:11.1232586Z     @given(
2025-05-07T20:33:11.1233052Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.1233668Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.1234269Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.1234930Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.1235584Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.1236150Z     )
2025-05-07T20:33:11.1236850Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.1237726Z     def test_silu_mul_quant(
2025-05-07T20:33:11.1238213Z         self,
2025-05-07T20:33:11.1238597Z         T: int,
2025-05-07T20:33:11.1238996Z         D: int,
2025-05-07T20:33:11.1239302Z         scale_ub: Optional[float],
2025-05-07T20:33:11.1239671Z         contiguous: bool,
2025-05-07T20:33:11.1239913Z         compiled: bool,
2025-05-07T20:33:11.1240146Z     ) -> None:
2025-05-07T20:33:11.1240428Z         torch.manual_seed(2025)
2025-05-07T20:33:11.1240681Z     
2025-05-07T20:33:11.1240963Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.1241297Z     
2025-05-07T20:33:11.1241499Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.1241790Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.1242095Z         x = x_sign * x_clamp
2025-05-07T20:33:11.1242407Z         x0 = x[:, :D]
2025-05-07T20:33:11.1242624Z         x1 = x[:, D:]
2025-05-07T20:33:11.1242826Z     
2025-05-07T20:33:11.1243022Z         if contiguous:
2025-05-07T20:33:11.1243260Z             x0 = x0.contiguous()
2025-05-07T20:33:11.1243527Z             x1 = x1.contiguous()
2025-05-07T20:33:11.1243766Z     
2025-05-07T20:33:11.1243963Z         if scale_ub is not None:
2025-05-07T20:33:11.1244243Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.1244572Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.1244886Z             )
2025-05-07T20:33:11.1245169Z         else:
2025-05-07T20:33:11.1245384Z             scale_ub_tensor = None
2025-05-07T20:33:11.1245636Z     
2025-05-07T20:33:11.1245874Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.1246190Z             op = silu_mul_quant
2025-05-07T20:33:11.1246445Z             if compiled:
2025-05-07T20:33:11.1246700Z                 op = torch.compile(op)
2025-05-07T20:33:11.1246997Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.1247274Z     
2025-05-07T20:33:11.1247470Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.1247636Z 
2025-05-07T20:33:11.1247738Z moe/activation_test.py:117: 
2025-05-07T20:33:11.1248042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.1248373Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.1248652Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.1249215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.1249772Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.1250429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.1251107Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.1251642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.1252316Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.1253027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.1253555Z     kernel = self.compile(
2025-05-07T20:33:11.1254095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.1254749Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.1255143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.1255372Z 
2025-05-07T20:33:11.1255582Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1c5d940>
2025-05-07T20:33:11.1256650Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.1258002Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1c019e0>}
2025-05-07T20:33:11.1259490Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.1260640Z context = <triton._C.libtriton.ir.context object at 0x7f03a1b1db30>
2025-05-07T20:33:11.1260936Z 
2025-05-07T20:33:11.1261106Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.1261622Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.1262080Z                            module_map=module_map)
2025-05-07T20:33:11.1262446Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.1262864Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.1263123Z E       ^
2025-05-07T20:33:11.1263595Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.1264046Z 
2025-05-07T20:33:11.1264458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.1264966Z 
2025-05-07T20:33:11.1265079Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.1265550Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.1265954Z     T=2048,
2025-05-07T20:33:11.1266151Z     D=5120,
2025-05-07T20:33:11.1266345Z     scale_ub=1200.0,
2025-05-07T20:33:11.1266574Z     contiguous=False,
2025-05-07T20:33:11.1266808Z     compiled=True,
2025-05-07T20:33:11.1267010Z )
2025-05-07T20:33:11.1267332Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.1267827Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:11.1268098Z 
2025-05-07T20:33:11.1268185Z     @given(
2025-05-07T20:33:11.1268416Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.1268732Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.1269040Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.1269371Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.1269706Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.1269998Z     )
2025-05-07T20:33:11.1270350Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.1270795Z     def test_silu_mul_quant(
2025-05-07T20:33:11.1271040Z         self,
2025-05-07T20:33:11.1271242Z         T: int,
2025-05-07T20:33:11.1271441Z         D: int,
2025-05-07T20:33:11.1271668Z         scale_ub: Optional[float],
2025-05-07T20:33:11.1271940Z         contiguous: bool,
2025-05-07T20:33:11.1272185Z         compiled: bool,
2025-05-07T20:33:11.1272412Z     ) -> None:
2025-05-07T20:33:11.1272629Z         torch.manual_seed(2025)
2025-05-07T20:33:11.1272870Z     
2025-05-07T20:33:11.1273144Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.1273486Z     
2025-05-07T20:33:11.1273685Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.1273984Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.1274298Z         x = x_sign * x_clamp
2025-05-07T20:33:11.1274538Z         x0 = x[:, :D]
2025-05-07T20:33:11.1274769Z         x1 = x[:, D:]
2025-05-07T20:33:11.1274983Z     
2025-05-07T20:33:11.1275168Z         if contiguous:
2025-05-07T20:33:11.1275411Z             x0 = x0.contiguous()
2025-05-07T20:33:11.1275677Z             x1 = x1.contiguous()
2025-05-07T20:33:11.1275922Z     
2025-05-07T20:33:11.1276117Z         if scale_ub is not None:
2025-05-07T20:33:11.1276401Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.1276746Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.1277050Z             )
2025-05-07T20:33:11.1277247Z         else:
2025-05-07T20:33:11.1277461Z             scale_ub_tensor = None
2025-05-07T20:33:11.1277710Z     
2025-05-07T20:33:11.1277945Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.1278261Z             op = silu_mul_quant
2025-05-07T20:33:11.1278562Z             if compiled:
2025-05-07T20:33:11.1278814Z                 op = torch.compile(op)
2025-05-07T20:33:11.1279155Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.1279465Z     
2025-05-07T20:33:11.1279677Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.1279842Z 
2025-05-07T20:33:11.1279947Z moe/activation_test.py:117: 
2025-05-07T20:33:11.1280241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.1280573Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.1280857Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.1281452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.1282003Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.1282655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.1283337Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.1283869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.1284581Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.1285236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.1285761Z     kernel = self.compile(
2025-05-07T20:33:11.1286300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.1286953Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.1287354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.1287580Z 
2025-05-07T20:33:11.1287791Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1c5da00>
2025-05-07T20:33:11.1288856Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.1290264Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1c02b60>}
2025-05-07T20:33:11.1291581Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.1292592Z context = <triton._C.libtriton.ir.context object at 0x7f03a1bd3770>
2025-05-07T20:33:11.1292875Z 
2025-05-07T20:33:11.1293103Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.1293619Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.1294092Z                            module_map=module_map)
2025-05-07T20:33:11.1294461Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.1294818Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.1295085Z E       ^
2025-05-07T20:33:11.1295550Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.1295995Z 
2025-05-07T20:33:11.1296411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.1296917Z 
2025-05-07T20:33:11.3008760Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.3009193Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.3009644Z     T=4096,
2025-05-07T20:33:11.3009840Z     D=5120,
2025-05-07T20:33:11.3010025Z     scale_ub=1200.0,
2025-05-07T20:33:11.3010242Z     contiguous=True,
2025-05-07T20:33:11.3010463Z     compiled=True,
2025-05-07T20:33:11.3010777Z )
2025-05-07T20:33:11.3011094Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.3011635Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:11.3011901Z 
2025-05-07T20:33:11.3011977Z     @given(
2025-05-07T20:33:11.3012201Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.3012508Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.3012802Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.3013222Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.3013630Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.3013911Z     )
2025-05-07T20:33:11.3014273Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.3014701Z     def test_silu_mul_quant(
2025-05-07T20:33:11.3014936Z         self,
2025-05-07T20:33:11.3015137Z         T: int,
2025-05-07T20:33:11.3015329Z         D: int,
2025-05-07T20:33:11.3015542Z         scale_ub: Optional[float],
2025-05-07T20:33:11.3015806Z         contiguous: bool,
2025-05-07T20:33:11.3016040Z         compiled: bool,
2025-05-07T20:33:11.3016327Z     ) -> None:
2025-05-07T20:33:11.3016546Z         torch.manual_seed(2025)
2025-05-07T20:33:11.3016780Z     
2025-05-07T20:33:11.3017053Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.3017384Z     
2025-05-07T20:33:11.3017575Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.3017865Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.3018171Z         x = x_sign * x_clamp
2025-05-07T20:33:11.3018401Z         x0 = x[:, :D]
2025-05-07T20:33:11.3018611Z         x1 = x[:, D:]
2025-05-07T20:33:11.3018816Z     
2025-05-07T20:33:11.3019005Z         if contiguous:
2025-05-07T20:33:11.3019230Z             x0 = x0.contiguous()
2025-05-07T20:33:11.3019485Z             x1 = x1.contiguous()
2025-05-07T20:33:11.3019729Z     
2025-05-07T20:33:11.3019916Z         if scale_ub is not None:
2025-05-07T20:33:11.3020186Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.3020525Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.3020823Z             )
2025-05-07T20:33:11.3021013Z         else:
2025-05-07T20:33:11.3021225Z             scale_ub_tensor = None
2025-05-07T20:33:11.3021469Z     
2025-05-07T20:33:11.3021701Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.3022011Z             op = silu_mul_quant
2025-05-07T20:33:11.3022254Z             if compiled:
2025-05-07T20:33:11.3022503Z                 op = torch.compile(op)
2025-05-07T20:33:11.3022794Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.3023059Z     
2025-05-07T20:33:11.3023249Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.3023416Z 
2025-05-07T20:33:11.3023511Z moe/activation_test.py:117: 
2025-05-07T20:33:11.3023805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.3024128Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.3024400Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.3024952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.3025495Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.3026140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.3026811Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.3027342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.3028002Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.3028646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.3029166Z     kernel = self.compile(
2025-05-07T20:33:11.3029791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.3030479Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.3030869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.3031089Z 
2025-05-07T20:33:11.3031300Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a19f1490>
2025-05-07T20:33:11.3032357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.3033778Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1978180>}
2025-05-07T20:33:11.3035093Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.3036137Z context = <triton._C.libtriton.ir.context object at 0x7f03a19de5b0>
2025-05-07T20:33:11.3036423Z 
2025-05-07T20:33:11.3036590Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.3037098Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.3037563Z                            module_map=module_map)
2025-05-07T20:33:11.3037928Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.3038274Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.3038538Z E       ^
2025-05-07T20:33:11.3038998Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.3039441Z 
2025-05-07T20:33:11.3039904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.3040409Z 
2025-05-07T20:33:11.3040522Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.3040932Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.3041331Z     T=128,
2025-05-07T20:33:11.3041513Z     D=5120,
2025-05-07T20:33:11.3041708Z     scale_ub=1200.0,
2025-05-07T20:33:11.3041931Z     contiguous=False,
2025-05-07T20:33:11.3042156Z     compiled=True,
2025-05-07T20:33:11.3042357Z )
2025-05-07T20:33:11.5710282Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.5711228Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:11.5711734Z 
2025-05-07T20:33:11.5711884Z     @given(
2025-05-07T20:33:11.5712307Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.5712884Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.5721387Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.5721732Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.5722069Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.5722352Z     )
2025-05-07T20:33:11.5722699Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.5723173Z     def test_silu_mul_quant(
2025-05-07T20:33:11.5723413Z         self,
2025-05-07T20:33:11.5723609Z         T: int,
2025-05-07T20:33:11.5723810Z         D: int,
2025-05-07T20:33:11.5724026Z         scale_ub: Optional[float],
2025-05-07T20:33:11.5724300Z         contiguous: bool,
2025-05-07T20:33:11.5724545Z         compiled: bool,
2025-05-07T20:33:11.5724762Z     ) -> None:
2025-05-07T20:33:11.5724972Z         torch.manual_seed(2025)
2025-05-07T20:33:11.5725210Z     
2025-05-07T20:33:11.5725479Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.5725811Z     
2025-05-07T20:33:11.5726126Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.5726408Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.5726772Z         x = x_sign * x_clamp
2025-05-07T20:33:11.5727019Z         x0 = x[:, :D]
2025-05-07T20:33:11.5727231Z         x1 = x[:, D:]
2025-05-07T20:33:11.5727426Z     
2025-05-07T20:33:11.5727609Z         if contiguous:
2025-05-07T20:33:11.5727841Z             x0 = x0.contiguous()
2025-05-07T20:33:11.5728089Z             x1 = x1.contiguous()
2025-05-07T20:33:11.5728323Z     
2025-05-07T20:33:11.5728510Z         if scale_ub is not None:
2025-05-07T20:33:11.5728840Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.5729172Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.5729474Z             )
2025-05-07T20:33:11.5729662Z         else:
2025-05-07T20:33:11.5729878Z             scale_ub_tensor = None
2025-05-07T20:33:11.5730124Z     
2025-05-07T20:33:11.5730353Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.5730657Z             op = silu_mul_quant
2025-05-07T20:33:11.5730900Z             if compiled:
2025-05-07T20:33:11.5731145Z                 op = torch.compile(op)
2025-05-07T20:33:11.5731496Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.5731767Z     
2025-05-07T20:33:11.5731954Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.5732112Z 
2025-05-07T20:33:11.5732211Z moe/activation_test.py:117: 
2025-05-07T20:33:11.5732504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.5732834Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.5733189Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.5733738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.5734293Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.5734940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.5735605Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.5736136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.5736800Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.5737452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.5737969Z     kernel = self.compile(
2025-05-07T20:33:11.5738501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.5739145Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.5739531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.5739761Z 
2025-05-07T20:33:11.5739966Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a19f03e0>
2025-05-07T20:33:11.5741032Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.5742378Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1978ea0>}
2025-05-07T20:33:11.5743688Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.5744693Z context = <triton._C.libtriton.ir.context object at 0x7f03a1755b70>
2025-05-07T20:33:11.5744983Z 
2025-05-07T20:33:11.5745147Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.5745658Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.5746167Z                            module_map=module_map)
2025-05-07T20:33:11.5746565Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.5746920Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.5747177Z E       ^
2025-05-07T20:33:11.5747631Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.5748077Z 
2025-05-07T20:33:11.5748485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.5749026Z 
2025-05-07T20:33:11.5749131Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.5749566Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.5749984Z     T=16384,
2025-05-07T20:33:11.5750174Z     D=7168,
2025-05-07T20:33:11.5750365Z     scale_ub=1200.0,
2025-05-07T20:33:11.5750579Z     contiguous=True,
2025-05-07T20:33:11.5750795Z     compiled=True,
2025-05-07T20:33:11.5751000Z )
2025-05-07T20:33:11.5751352Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.5751839Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:11.5752107Z 
2025-05-07T20:33:11.5752193Z     @given(
2025-05-07T20:33:11.5752416Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.5752722Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.5753022Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.5753348Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.5753662Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.5753939Z     )
2025-05-07T20:33:11.5754288Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.5754711Z     def test_silu_mul_quant(
2025-05-07T20:33:11.5754947Z         self,
2025-05-07T20:33:11.5755146Z         T: int,
2025-05-07T20:33:11.5755339Z         D: int,
2025-05-07T20:33:11.5755551Z         scale_ub: Optional[float],
2025-05-07T20:33:11.5755819Z         contiguous: bool,
2025-05-07T20:33:11.5756052Z         compiled: bool,
2025-05-07T20:33:11.5756278Z     ) -> None:
2025-05-07T20:33:11.5756487Z         torch.manual_seed(2025)
2025-05-07T20:33:11.5756721Z     
2025-05-07T20:33:11.5756995Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.5757330Z     
2025-05-07T20:33:11.5757527Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.5757814Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.5758119Z         x = x_sign * x_clamp
2025-05-07T20:33:11.5758353Z         x0 = x[:, :D]
2025-05-07T20:33:11.5758560Z         x1 = x[:, D:]
2025-05-07T20:33:11.5758761Z     
2025-05-07T20:33:11.5758944Z         if contiguous:
2025-05-07T20:33:11.5759166Z             x0 = x0.contiguous()
2025-05-07T20:33:11.5759684Z             x1 = x1.contiguous()
2025-05-07T20:33:11.5759954Z     
2025-05-07T20:33:11.5760165Z         if scale_ub is not None:
2025-05-07T20:33:11.5760443Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.5760776Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.5761072Z             )
2025-05-07T20:33:11.5761260Z         else:
2025-05-07T20:33:11.5761469Z             scale_ub_tensor = None
2025-05-07T20:33:11.5761712Z     
2025-05-07T20:33:11.5761941Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.5762250Z             op = silu_mul_quant
2025-05-07T20:33:11.5762503Z             if compiled:
2025-05-07T20:33:11.5762741Z                 op = torch.compile(op)
2025-05-07T20:33:11.5763028Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.5763298Z     
2025-05-07T20:33:11.5763483Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.5763649Z 
2025-05-07T20:33:11.5763742Z moe/activation_test.py:117: 
2025-05-07T20:33:11.5764103Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.5764420Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.5764756Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.5765307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.5765852Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.5766492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.5767224Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.5767749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.5768405Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.5769054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.5769579Z     kernel = self.compile(
2025-05-07T20:33:11.5770177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.5770814Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.5771205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.5771426Z 
2025-05-07T20:33:11.5771632Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a19f2510>
2025-05-07T20:33:11.5772685Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.5774064Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a197a0c0>}
2025-05-07T20:33:11.5775379Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.5776377Z context = <triton._C.libtriton.ir.context object at 0x7f0552a7d630>
2025-05-07T20:33:11.5776657Z 
2025-05-07T20:33:11.5776820Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.5777321Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.5777782Z                            module_map=module_map)
2025-05-07T20:33:11.5778143Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.5778490Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.5778737Z E       ^
2025-05-07T20:33:11.5779192Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.5779634Z 
2025-05-07T20:33:11.5780044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.5780543Z 
2025-05-07T20:33:11.6987759Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.6988656Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.6989425Z     T=16384,
2025-05-07T20:33:11.6989648Z     D=5120,
2025-05-07T20:33:11.6989837Z     scale_ub=1200.0,
2025-05-07T20:33:11.6990066Z     contiguous=True,
2025-05-07T20:33:11.6990287Z     compiled=False,
2025-05-07T20:33:11.6990496Z )
2025-05-07T20:33:11.6990816Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.6991310Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:11.6991582Z 
2025-05-07T20:33:11.6991660Z     @given(
2025-05-07T20:33:11.6991891Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.6992309Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.6992615Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.6993023Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.6993349Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.6993627Z     )
2025-05-07T20:33:11.6993967Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.6994405Z     def test_silu_mul_quant(
2025-05-07T20:33:11.6994640Z         self,
2025-05-07T20:33:11.6994830Z         T: int,
2025-05-07T20:33:11.6995089Z         D: int,
2025-05-07T20:33:11.6995306Z         scale_ub: Optional[float],
2025-05-07T20:33:11.6995567Z         contiguous: bool,
2025-05-07T20:33:11.6995805Z         compiled: bool,
2025-05-07T20:33:11.6996028Z     ) -> None:
2025-05-07T20:33:11.6996235Z         torch.manual_seed(2025)
2025-05-07T20:33:11.6996471Z     
2025-05-07T20:33:11.6996736Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.6997081Z     
2025-05-07T20:33:11.6997270Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.6997619Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.6997924Z         x = x_sign * x_clamp
2025-05-07T20:33:11.6998158Z         x0 = x[:, :D]
2025-05-07T20:33:11.6998374Z         x1 = x[:, D:]
2025-05-07T20:33:11.6998582Z     
2025-05-07T20:33:11.6998772Z         if contiguous:
2025-05-07T20:33:11.6999006Z             x0 = x0.contiguous()
2025-05-07T20:33:11.6999262Z             x1 = x1.contiguous()
2025-05-07T20:33:11.6999524Z     
2025-05-07T20:33:11.6999742Z         if scale_ub is not None:
2025-05-07T20:33:11.7000009Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.7000332Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.7000635Z             )
2025-05-07T20:33:11.7000830Z         else:
2025-05-07T20:33:11.7001037Z             scale_ub_tensor = None
2025-05-07T20:33:11.7001287Z     
2025-05-07T20:33:11.7001516Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.7001820Z             op = silu_mul_quant
2025-05-07T20:33:11.7002075Z             if compiled:
2025-05-07T20:33:11.7002318Z                 op = torch.compile(op)
2025-05-07T20:33:11.7002606Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.7002879Z     
2025-05-07T20:33:11.7003071Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.7003231Z 
2025-05-07T20:33:11.7003330Z moe/activation_test.py:117: 
2025-05-07T20:33:11.7003617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.7003944Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.7004228Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.7004901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.7005583Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.7006118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.7006792Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.7007442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.7007968Z     kernel = self.compile(
2025-05-07T20:33:11.7008503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.7009143Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.7009539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.7009770Z 
2025-05-07T20:33:11.7010000Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a19f2390>
2025-05-07T20:33:11.7011085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.7012516Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1979a80>}
2025-05-07T20:33:11.7013937Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.7014984Z context = <triton._C.libtriton.ir.context object at 0x7f0552acb830>
2025-05-07T20:33:11.7015263Z 
2025-05-07T20:33:11.7015428Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.7015935Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.7016395Z                            module_map=module_map)
2025-05-07T20:33:11.7016765Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.7017116Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.7017415Z E       ^
2025-05-07T20:33:11.7017873Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.7018313Z 
2025-05-07T20:33:11.7018728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.7019232Z 
2025-05-07T20:33:11.7019341Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.7019751Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.7020148Z     T=1,
2025-05-07T20:33:11.7020331Z     D=7168,
2025-05-07T20:33:11.7020520Z     scale_ub=1200.0,
2025-05-07T20:33:11.7020741Z     contiguous=False,
2025-05-07T20:33:11.7020968Z     compiled=False,
2025-05-07T20:33:11.7021172Z )
2025-05-07T20:33:11.7021492Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.7021983Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:11.7022248Z 
2025-05-07T20:33:11.7022328Z     @given(
2025-05-07T20:33:11.7022561Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.7022874Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.7023176Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.7023524Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.7023854Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.7024140Z     )
2025-05-07T20:33:11.7024485Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.7024916Z     def test_silu_mul_quant(
2025-05-07T20:33:11.7025159Z         self,
2025-05-07T20:33:11.7025355Z         T: int,
2025-05-07T20:33:11.7025548Z         D: int,
2025-05-07T20:33:11.7025769Z         scale_ub: Optional[float],
2025-05-07T20:33:11.7026035Z         contiguous: bool,
2025-05-07T20:33:11.7026276Z         compiled: bool,
2025-05-07T20:33:11.7026504Z     ) -> None:
2025-05-07T20:33:11.7026722Z         torch.manual_seed(2025)
2025-05-07T20:33:11.7026959Z     
2025-05-07T20:33:11.7027230Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.7027575Z     
2025-05-07T20:33:11.7027771Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.7028055Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.7028363Z         x = x_sign * x_clamp
2025-05-07T20:33:11.7028604Z         x0 = x[:, :D]
2025-05-07T20:33:11.7028816Z         x1 = x[:, D:]
2025-05-07T20:33:11.7029024Z     
2025-05-07T20:33:11.7029214Z         if contiguous:
2025-05-07T20:33:11.7029449Z             x0 = x0.contiguous()
2025-05-07T20:33:11.7029739Z             x1 = x1.contiguous()
2025-05-07T20:33:11.7030005Z     
2025-05-07T20:33:11.7030194Z         if scale_ub is not None:
2025-05-07T20:33:11.7030516Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.7030847Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.7031181Z             )
2025-05-07T20:33:11.7031371Z         else:
2025-05-07T20:33:11.7031578Z             scale_ub_tensor = None
2025-05-07T20:33:11.7031825Z     
2025-05-07T20:33:11.7032056Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.7032370Z             op = silu_mul_quant
2025-05-07T20:33:11.7032619Z             if compiled:
2025-05-07T20:33:11.7032863Z                 op = torch.compile(op)
2025-05-07T20:33:11.7033203Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.7033474Z     
2025-05-07T20:33:11.7033660Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.7033823Z 
2025-05-07T20:33:11.7033920Z moe/activation_test.py:117: 
2025-05-07T20:33:11.7034210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.7034530Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.7034811Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.7035536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.7036214Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.7036739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.7037413Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.7038065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.7038587Z     kernel = self.compile(
2025-05-07T20:33:11.7039122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.7039789Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.7040216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.7040441Z 
2025-05-07T20:33:11.7040653Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552abda90>
2025-05-07T20:33:11.7041715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.7043058Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552a400e0>}
2025-05-07T20:33:11.7044378Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.7045382Z context = <triton._C.libtriton.ir.context object at 0x7f0552a61c30>
2025-05-07T20:33:11.7045678Z 
2025-05-07T20:33:11.7045844Z     def make_ir(self, options, codegen_fns, module_map, context):
﻿2025-05-07T20:33:11.7048936Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.7049396Z                            module_map=module_map)
2025-05-07T20:33:11.7049767Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.7050121Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.7050388Z E       ^
2025-05-07T20:33:11.7050843Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.7051295Z 
2025-05-07T20:33:11.7051710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.7052217Z 
2025-05-07T20:33:11.8765340Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8766900Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8767781Z     T=4096,
2025-05-07T20:33:11.8768159Z     D=7168,
2025-05-07T20:33:11.8768771Z     scale_ub=1200.0,
2025-05-07T20:33:11.8769270Z     contiguous=False,
2025-05-07T20:33:11.8769706Z     compiled=True,
2025-05-07T20:33:11.8769980Z )
2025-05-07T20:33:11.8770343Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8770835Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:11.8771111Z 
2025-05-07T20:33:11.8771186Z     @given(
2025-05-07T20:33:11.8771490Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8771792Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8772099Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8772425Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8772749Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8773104Z     )
2025-05-07T20:33:11.8773450Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8773888Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8774192Z         self,
2025-05-07T20:33:11.8774391Z         T: int,
2025-05-07T20:33:11.8774584Z         D: int,
2025-05-07T20:33:11.8774794Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8775061Z         contiguous: bool,
2025-05-07T20:33:11.8775296Z         compiled: bool,
2025-05-07T20:33:11.8775515Z     ) -> None:
2025-05-07T20:33:11.8775732Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8775976Z     
2025-05-07T20:33:11.8776244Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8776582Z     
2025-05-07T20:33:11.8776776Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8777059Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8777364Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8777602Z         x0 = x[:, :D]
2025-05-07T20:33:11.8777809Z         x1 = x[:, D:]
2025-05-07T20:33:11.8778009Z     
2025-05-07T20:33:11.8778188Z         if contiguous:
2025-05-07T20:33:11.8778419Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8778675Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8778910Z     
2025-05-07T20:33:11.8779102Z         if scale_ub is not None:
2025-05-07T20:33:11.8779370Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8779704Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8780013Z             )
2025-05-07T20:33:11.8780199Z         else:
2025-05-07T20:33:11.8780413Z             scale_ub_tensor = None
2025-05-07T20:33:11.8780662Z     
2025-05-07T20:33:11.8780889Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8781198Z             op = silu_mul_quant
2025-05-07T20:33:11.8781445Z             if compiled:
2025-05-07T20:33:11.8781686Z                 op = torch.compile(op)
2025-05-07T20:33:11.8781980Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8782259Z     
2025-05-07T20:33:11.8782445Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8782610Z 
2025-05-07T20:33:11.8782714Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8783123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8783450Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8783721Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8784274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8784826Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8785471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8786148Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8792118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8792833Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8793570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8794107Z     kernel = self.compile(
2025-05-07T20:33:11.8794645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8795290Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8795681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8795950Z 
2025-05-07T20:33:11.8796155Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552abd640>
2025-05-07T20:33:11.8797213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8798616Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552a41300>}
2025-05-07T20:33:11.8799982Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8800981Z context = <triton._C.libtriton.ir.context object at 0x7f03a182caf0>
2025-05-07T20:33:11.8801263Z 
2025-05-07T20:33:11.8801436Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8801940Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8802402Z                            module_map=module_map)
2025-05-07T20:33:11.8802767Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8803110Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8803365Z E       ^
2025-05-07T20:33:11.8803829Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8804272Z 
2025-05-07T20:33:11.8804685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8805184Z 
2025-05-07T20:33:11.8805285Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8805691Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8806087Z     T=128,
2025-05-07T20:33:11.8806270Z     D=7168,
2025-05-07T20:33:11.8806459Z     scale_ub=1200.0,
2025-05-07T20:33:11.8806684Z     contiguous=False,
2025-05-07T20:33:11.8806912Z     compiled=True,
2025-05-07T20:33:11.8807105Z )
2025-05-07T20:33:11.9699243Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9699801Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:11.9700081Z 
2025-05-07T20:33:11.9700161Z     @given(
2025-05-07T20:33:11.9700405Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9700846Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9701153Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9701486Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9701809Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9702095Z     )
2025-05-07T20:33:11.9702452Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9702901Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9703142Z         self,
2025-05-07T20:33:11.9703340Z         T: int,
2025-05-07T20:33:11.9703538Z         D: int,
2025-05-07T20:33:11.9703752Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9704027Z         contiguous: bool,
2025-05-07T20:33:11.9704267Z         compiled: bool,
2025-05-07T20:33:11.9704486Z     ) -> None:
2025-05-07T20:33:11.9704700Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9704943Z     
2025-05-07T20:33:11.9705283Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9705633Z     
2025-05-07T20:33:11.9705829Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.9706114Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.9706421Z         x = x_sign * x_clamp
2025-05-07T20:33:11.9706660Z         x0 = x[:, :D]
2025-05-07T20:33:11.9706868Z         x1 = x[:, D:]
2025-05-07T20:33:11.9707142Z     
2025-05-07T20:33:11.9707327Z         if contiguous:
2025-05-07T20:33:11.9707557Z             x0 = x0.contiguous()
2025-05-07T20:33:11.9707820Z             x1 = x1.contiguous()
2025-05-07T20:33:11.9708065Z     
2025-05-07T20:33:11.9708261Z         if scale_ub is not None:
2025-05-07T20:33:11.9708527Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.9708862Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.9709170Z             )
2025-05-07T20:33:11.9709362Z         else:
2025-05-07T20:33:11.9709576Z             scale_ub_tensor = None
2025-05-07T20:33:11.9709897Z     
2025-05-07T20:33:11.9710128Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.9710437Z             op = silu_mul_quant
2025-05-07T20:33:11.9710683Z             if compiled:
2025-05-07T20:33:11.9710921Z                 op = torch.compile(op)
2025-05-07T20:33:11.9711218Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.9711488Z     
2025-05-07T20:33:11.9711679Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.9711845Z 
2025-05-07T20:33:11.9711944Z moe/activation_test.py:117: 
2025-05-07T20:33:11.9712238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.9712563Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.9712835Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.9713401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.9713961Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.9714622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.9715298Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.9715833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.9716507Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.9717161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.9717686Z     kernel = self.compile(
2025-05-07T20:33:11.9718224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.9718865Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.9719257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.9719546Z 
2025-05-07T20:33:11.9719771Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552abd1c0>
2025-05-07T20:33:11.9720868Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.9722220Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552a42020>}
2025-05-07T20:33:11.9723557Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.9724580Z context = <triton._C.libtriton.ir.context object at 0x7f03a180bd30>
2025-05-07T20:33:11.9724872Z 
2025-05-07T20:33:11.9725080Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.9725599Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.9726062Z                            module_map=module_map)
2025-05-07T20:33:11.9726426Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.9726780Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.9727077Z E       ^
2025-05-07T20:33:11.9727535Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.9727984Z 
2025-05-07T20:33:11.9728393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.9728894Z 
2025-05-07T20:33:11.9729002Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9729405Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9729802Z     T=2048,
2025-05-07T20:33:11.9730092Z     D=7168,
2025-05-07T20:33:11.9730303Z     scale_ub=None,
2025-05-07T20:33:11.9730519Z     contiguous=True,
2025-05-07T20:33:11.9730741Z     compiled=True,
2025-05-07T20:33:11.9730938Z )
2025-05-07T20:33:11.9731265Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9731750Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:11.9732012Z 
2025-05-07T20:33:11.9732091Z     @given(
2025-05-07T20:33:11.9732314Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9732623Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9732923Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9733323Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9733650Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9733943Z     )
2025-05-07T20:33:11.9734288Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9734730Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9734972Z         self,
2025-05-07T20:33:11.9735165Z         T: int,
2025-05-07T20:33:11.9735364Z         D: int,
2025-05-07T20:33:11.9735583Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9735848Z         contiguous: bool,
2025-05-07T20:33:11.9736083Z         compiled: bool,
2025-05-07T20:33:11.9736305Z     ) -> None:
2025-05-07T20:33:11.9736529Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9736786Z     
2025-05-07T20:33:11.9737082Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9737459Z     
2025-05-07T20:33:11.9737658Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.9737970Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.9738312Z         x = x_sign * x_clamp
2025-05-07T20:33:11.9738565Z         x0 = x[:, :D]
2025-05-07T20:33:11.9738794Z         x1 = x[:, D:]
2025-05-07T20:33:11.9739008Z     
2025-05-07T20:33:11.9739199Z         if contiguous:
2025-05-07T20:33:11.9739488Z             x0 = x0.contiguous()
2025-05-07T20:33:11.9739753Z             x1 = x1.contiguous()
2025-05-07T20:33:11.9739994Z     
2025-05-07T20:33:11.9740195Z         if scale_ub is not None:
2025-05-07T20:33:11.9740466Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.9740789Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.9741094Z             )
2025-05-07T20:33:11.9741287Z         else:
2025-05-07T20:33:11.9741493Z             scale_ub_tensor = None
2025-05-07T20:33:11.9741744Z     
2025-05-07T20:33:11.9741970Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.9742282Z             op = silu_mul_quant
2025-05-07T20:33:11.9742523Z             if compiled:
2025-05-07T20:33:11.9742768Z                 op = torch.compile(op)
2025-05-07T20:33:11.9743061Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.9743339Z     
2025-05-07T20:33:11.9743529Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.9743740Z 
2025-05-07T20:33:11.9743840Z moe/activation_test.py:117: 
2025-05-07T20:33:11.9744133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.9744457Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.9744739Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.9745289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.9745873Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.9746521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.9747192Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.9747721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.9748387Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.9749072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.9749601Z     kernel = self.compile(
2025-05-07T20:33:11.9750187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.9750828Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.9751223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.9751447Z 
2025-05-07T20:33:11.9751660Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a17505c0>
2025-05-07T20:33:11.9752715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.9754059Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552a43240>}
2025-05-07T20:33:11.9755368Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.9756377Z context = <triton._C.libtriton.ir.context object at 0x7f03a1740c70>
2025-05-07T20:33:11.9756665Z 
2025-05-07T20:33:11.9756827Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.9757336Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.9757793Z                            module_map=module_map)
2025-05-07T20:33:11.9758160Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.9758503Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.9758762Z E       ^
2025-05-07T20:33:11.9759388Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.9759930Z 
2025-05-07T20:33:11.9760375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.9760874Z 
2025-05-07T20:33:12.0437133Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.0438107Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.0439189Z     T=16384,
2025-05-07T20:33:12.0439595Z     D=5120,
2025-05-07T20:33:12.0439884Z     scale_ub=None,
2025-05-07T20:33:12.0440147Z     contiguous=False,
2025-05-07T20:33:12.0440396Z     compiled=False,
2025-05-07T20:33:12.0440605Z )
2025-05-07T20:33:12.0440934Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.0441435Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:12.0441718Z 
2025-05-07T20:33:12.0441917Z     @given(
2025-05-07T20:33:12.0442156Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.0442481Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.0442792Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.0443121Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.0443451Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.0443837Z     )
2025-05-07T20:33:12.0444184Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.0444633Z     def test_silu_mul_quant(
2025-05-07T20:33:12.0444881Z         self,
2025-05-07T20:33:12.0445080Z         T: int,
2025-05-07T20:33:12.0445280Z         D: int,
2025-05-07T20:33:12.0445503Z         scale_ub: Optional[float],
2025-05-07T20:33:12.0445786Z         contiguous: bool,
2025-05-07T20:33:12.0446029Z         compiled: bool,
2025-05-07T20:33:12.0446266Z     ) -> None:
2025-05-07T20:33:12.0446498Z         torch.manual_seed(2025)
2025-05-07T20:33:12.0446808Z     
2025-05-07T20:33:12.0447088Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.0447432Z     
2025-05-07T20:33:12.0447623Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.0447921Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.0449934Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.0451823Z 
2025-05-07T20:33:12.0451948Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:12.0452169Z 
2025-05-07T20:33:12.0452274Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.0452695Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.0453202Z     T=4096,
2025-05-07T20:33:12.0453402Z     D=7168,
2025-05-07T20:33:12.0453599Z     scale_ub=1200.0,
2025-05-07T20:33:12.0453824Z     contiguous=True,
2025-05-07T20:33:12.0454052Z     compiled=True,
2025-05-07T20:33:12.0454266Z )
2025-05-07T20:33:12.0454590Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.0455090Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:12.0455359Z 
2025-05-07T20:33:12.0455447Z     @given(
2025-05-07T20:33:12.0455680Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.0455988Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.0456293Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.0456623Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.0457032Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.0457329Z     )
2025-05-07T20:33:12.0457679Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.0458118Z     def test_silu_mul_quant(
2025-05-07T20:33:12.0458362Z         self,
2025-05-07T20:33:12.0458562Z         T: int,
2025-05-07T20:33:12.0458760Z         D: int,
2025-05-07T20:33:12.0458983Z         scale_ub: Optional[float],
2025-05-07T20:33:12.0459497Z         contiguous: bool,
2025-05-07T20:33:12.0459786Z         compiled: bool,
2025-05-07T20:33:12.0460019Z     ) -> None:
2025-05-07T20:33:12.0460241Z         torch.manual_seed(2025)
2025-05-07T20:33:12.0460485Z     
2025-05-07T20:33:12.0460755Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.0461100Z     
2025-05-07T20:33:12.0461297Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.0461674Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.0463668Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.0465568Z 
2025-05-07T20:33:12.0465691Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:12.0465898Z 
2025-05-07T20:33:12.0466008Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.0466420Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.0466813Z     T=16384,
2025-05-07T20:33:12.0467013Z     D=7168,
2025-05-07T20:33:12.0467273Z     scale_ub=None,
2025-05-07T20:33:12.0467485Z     contiguous=False,
2025-05-07T20:33:12.0467713Z     compiled=False,
2025-05-07T20:33:12.0467921Z )
2025-05-07T20:33:12.0468235Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.0468732Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:12.0469008Z 
2025-05-07T20:33:12.0469103Z     @given(
2025-05-07T20:33:12.0469331Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.0469670Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.0470007Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.0470337Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.0470662Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.0470952Z     )
2025-05-07T20:33:12.0471315Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.0471755Z     def test_silu_mul_quant(
2025-05-07T20:33:12.0472006Z         self,
2025-05-07T20:33:12.0472203Z         T: int,
2025-05-07T20:33:12.0472400Z         D: int,
2025-05-07T20:33:12.0472626Z         scale_ub: Optional[float],
2025-05-07T20:33:12.0472897Z         contiguous: bool,
2025-05-07T20:33:12.0473143Z         compiled: bool,
2025-05-07T20:33:12.0473364Z     ) -> None:
2025-05-07T20:33:12.0473579Z         torch.manual_seed(2025)
2025-05-07T20:33:12.0473821Z     
2025-05-07T20:33:12.0474087Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.0476119Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.0478036Z 
2025-05-07T20:33:12.0478156Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:12.0478365Z 
2025-05-07T20:33:12.0478474Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.0478885Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.0479288Z     T=2048,
2025-05-07T20:33:12.0479479Z     D=7168,
2025-05-07T20:33:12.0479669Z     scale_ub=1200.0,
2025-05-07T20:33:12.0479884Z     contiguous=True,
2025-05-07T20:33:12.0480108Z     compiled=True,
2025-05-07T20:33:12.0480310Z )
2025-05-07T20:33:12.0480627Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.0481115Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:12.0481383Z 
2025-05-07T20:33:12.0481470Z     @given(
2025-05-07T20:33:12.0481747Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.0482062Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.0482365Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.0482698Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.0483021Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.0483312Z     )
2025-05-07T20:33:12.0483704Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.0484134Z     def test_silu_mul_quant(
2025-05-07T20:33:12.0484378Z         self,
2025-05-07T20:33:12.0484575Z         T: int,
2025-05-07T20:33:12.0484766Z         D: int,
2025-05-07T20:33:12.0484987Z         scale_ub: Optional[float],
2025-05-07T20:33:12.0485257Z         contiguous: bool,
2025-05-07T20:33:12.0485495Z         compiled: bool,
2025-05-07T20:33:12.0485718Z     ) -> None:
2025-05-07T20:33:12.0485939Z         torch.manual_seed(2025)
2025-05-07T20:33:12.0486180Z     
2025-05-07T20:33:12.0486497Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.0486843Z     
2025-05-07T20:33:12.0487032Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.0487324Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.0489283Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.0491105Z 
2025-05-07T20:33:12.0491224Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:12.0491435Z 
2025-05-07T20:33:12.0491548Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.0491954Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.0492348Z     T=2048,
2025-05-07T20:33:12.0492537Z     D=7168,
2025-05-07T20:33:12.0492725Z     scale_ub=None,
2025-05-07T20:33:12.0492936Z     contiguous=True,
2025-05-07T20:33:12.0493217Z     compiled=False,
2025-05-07T20:33:12.0493423Z )
2025-05-07T20:33:12.1608647Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.1609178Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:12.1609516Z 
2025-05-07T20:33:12.1609625Z     @given(
2025-05-07T20:33:12.1609941Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.1610270Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.1610570Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.1610899Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.1611346Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.1611627Z     )
2025-05-07T20:33:12.1611973Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.1612408Z     def test_silu_mul_quant(
2025-05-07T20:33:12.1612642Z         self,
2025-05-07T20:33:12.1612838Z         T: int,
2025-05-07T20:33:12.1613117Z         D: int,
2025-05-07T20:33:12.1613333Z         scale_ub: Optional[float],
2025-05-07T20:33:12.1613602Z         contiguous: bool,
2025-05-07T20:33:12.1613840Z         compiled: bool,
2025-05-07T20:33:12.1614060Z     ) -> None:
2025-05-07T20:33:12.1614275Z         torch.manual_seed(2025)
2025-05-07T20:33:12.1614514Z     
2025-05-07T20:33:12.1614780Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.1615117Z     
2025-05-07T20:33:12.1615313Z >       x_sign = torch.sign(x)
2025-05-07T20:33:12.1617305Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.1619186Z 
2025-05-07T20:33:12.1619308Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:12.1619516Z 
2025-05-07T20:33:12.1619615Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.1620047Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.1620462Z     T=1,
2025-05-07T20:33:12.1620635Z     D=7168,
2025-05-07T20:33:12.1620819Z     scale_ub=1200.0,
2025-05-07T20:33:12.1621036Z     contiguous=True,
2025-05-07T20:33:12.1621311Z     compiled=False,
2025-05-07T20:33:12.1621514Z )
2025-05-07T20:33:12.1621832Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.1622306Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:12.1622573Z 
2025-05-07T20:33:12.1622650Z     @given(
2025-05-07T20:33:12.1622876Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.1629782Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.1630136Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.1630488Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.1630816Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.1631106Z     )
2025-05-07T20:33:12.1631457Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.1631901Z     def test_silu_mul_quant(
2025-05-07T20:33:12.1632137Z         self,
2025-05-07T20:33:12.1632330Z         T: int,
2025-05-07T20:33:12.1632532Z         D: int,
2025-05-07T20:33:12.1632750Z         scale_ub: Optional[float],
2025-05-07T20:33:12.1633014Z         contiguous: bool,
2025-05-07T20:33:12.1633245Z         compiled: bool,
2025-05-07T20:33:12.1633469Z     ) -> None:
2025-05-07T20:33:12.1633683Z         torch.manual_seed(2025)
2025-05-07T20:33:12.1633911Z     
2025-05-07T20:33:12.1634174Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.1634514Z     
2025-05-07T20:33:12.1634700Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.1634984Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.1635286Z         x = x_sign * x_clamp
2025-05-07T20:33:12.1635522Z         x0 = x[:, :D]
2025-05-07T20:33:12.1635732Z         x1 = x[:, D:]
2025-05-07T20:33:12.1635934Z     
2025-05-07T20:33:12.1636114Z         if contiguous:
2025-05-07T20:33:12.1636340Z             x0 = x0.contiguous()
2025-05-07T20:33:12.1636595Z             x1 = x1.contiguous()
2025-05-07T20:33:12.1636910Z     
2025-05-07T20:33:12.1637100Z         if scale_ub is not None:
2025-05-07T20:33:12.1637369Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.1637700Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.1637993Z             )
2025-05-07T20:33:12.1638183Z         else:
2025-05-07T20:33:12.1638387Z             scale_ub_tensor = None
2025-05-07T20:33:12.1638629Z     
2025-05-07T20:33:12.1638861Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.1639175Z             op = silu_mul_quant
2025-05-07T20:33:12.1639413Z             if compiled:
2025-05-07T20:33:12.1639660Z                 op = torch.compile(op)
2025-05-07T20:33:12.1639955Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.1640223Z     
2025-05-07T20:33:12.1640411Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.1640571Z 
2025-05-07T20:33:12.1640674Z moe/activation_test.py:117: 
2025-05-07T20:33:12.1641012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.1641340Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.1641613Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.1642300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.1642974Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.1643516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.1644229Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.1644885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.1645403Z     kernel = self.compile(
2025-05-07T20:33:12.1645940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.1646623Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.1647013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.1647246Z 
2025-05-07T20:33:12.1647451Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1753b00>
2025-05-07T20:33:12.1648519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.1649897Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1732520>}
2025-05-07T20:33:12.1651234Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.1652242Z context = <triton._C.libtriton.ir.context object at 0x7f03a1a88c30>
2025-05-07T20:33:12.1652522Z 
2025-05-07T20:33:12.1652685Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.1653292Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.1653756Z                            module_map=module_map)
2025-05-07T20:33:12.1654117Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.1654467Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.1654722Z E       ^
2025-05-07T20:33:12.1655179Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.1655619Z 
2025-05-07T20:33:12.1656029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.1656539Z 
2025-05-07T20:33:12.1656645Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.1657114Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.1657515Z     T=128,
2025-05-07T20:33:12.1657700Z     D=5120,
2025-05-07T20:33:12.1657891Z     scale_ub=None,
2025-05-07T20:33:12.1658107Z     contiguous=True,
2025-05-07T20:33:12.1658320Z     compiled=False,
2025-05-07T20:33:12.1658525Z )
2025-05-07T20:33:12.2321263Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.2321787Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:12.2322059Z 
2025-05-07T20:33:12.2322183Z     @given(
2025-05-07T20:33:12.2322506Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.2322934Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.2323336Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.2323662Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.2324104Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.2324402Z     )
2025-05-07T20:33:12.2324749Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.2325193Z     def test_silu_mul_quant(
2025-05-07T20:33:12.2325433Z         self,
2025-05-07T20:33:12.2325625Z         T: int,
2025-05-07T20:33:12.2325820Z         D: int,
2025-05-07T20:33:12.2326036Z         scale_ub: Optional[float],
2025-05-07T20:33:12.2326369Z         contiguous: bool,
2025-05-07T20:33:12.2326609Z         compiled: bool,
2025-05-07T20:33:12.2326831Z     ) -> None:
2025-05-07T20:33:12.2327042Z         torch.manual_seed(2025)
2025-05-07T20:33:12.2327280Z     
2025-05-07T20:33:12.2327551Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.2327884Z     
2025-05-07T20:33:12.2328075Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.2328363Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.2328671Z         x = x_sign * x_clamp
2025-05-07T20:33:12.2328977Z         x0 = x[:, :D]
2025-05-07T20:33:12.2329197Z         x1 = x[:, D:]
2025-05-07T20:33:12.2329405Z     
2025-05-07T20:33:12.2329594Z         if contiguous:
2025-05-07T20:33:12.2329840Z             x0 = x0.contiguous()
2025-05-07T20:33:12.2330138Z             x1 = x1.contiguous()
2025-05-07T20:33:12.2330373Z     
2025-05-07T20:33:12.2330567Z         if scale_ub is not None:
2025-05-07T20:33:12.2330837Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.2331165Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.2331479Z             )
2025-05-07T20:33:12.2331665Z         else:
2025-05-07T20:33:12.2331875Z             scale_ub_tensor = None
2025-05-07T20:33:12.2332131Z     
2025-05-07T20:33:12.2332368Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.2332675Z             op = silu_mul_quant
2025-05-07T20:33:12.2332923Z             if compiled:
2025-05-07T20:33:12.2333238Z                 op = torch.compile(op)
2025-05-07T20:33:12.2333537Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.2333819Z     
2025-05-07T20:33:12.2334010Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.2334177Z 
2025-05-07T20:33:12.2334279Z moe/activation_test.py:117: 
2025-05-07T20:33:12.2334567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.2334889Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.2335173Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.2335869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.2336563Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.2337099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.2337772Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.2338429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.2339030Z     kernel = self.compile(
2025-05-07T20:33:12.2339562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.2340207Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.2340606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.2340832Z 
2025-05-07T20:33:12.2341037Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1a25e50>
2025-05-07T20:33:12.2342099Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.2343483Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1733420>}
2025-05-07T20:33:12.2344814Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.2345830Z context = <triton._C.libtriton.ir.context object at 0x7f03a1ad92f0>
2025-05-07T20:33:12.2346151Z 
2025-05-07T20:33:12.2346323Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.2346833Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.2347293Z                            module_map=module_map)
2025-05-07T20:33:12.2347660Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.2348015Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.2348271Z E       ^
2025-05-07T20:33:12.2348774Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.2349219Z 
2025-05-07T20:33:12.2349641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.2350196Z 
2025-05-07T20:33:12.2350306Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.2350713Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.2351112Z     T=128,
2025-05-07T20:33:12.2351301Z     D=7168,
2025-05-07T20:33:12.2351487Z     scale_ub=None,
2025-05-07T20:33:12.2351699Z     contiguous=True,
2025-05-07T20:33:12.2351926Z     compiled=False,
2025-05-07T20:33:12.2352127Z )
2025-05-07T20:33:12.2352442Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.2352925Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:12.2353185Z 
2025-05-07T20:33:12.2353263Z     @given(
2025-05-07T20:33:12.2353524Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.2353833Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.2354135Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.2354456Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.2354777Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.2355059Z     )
2025-05-07T20:33:12.2355401Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.2355835Z     def test_silu_mul_quant(
2025-05-07T20:33:12.2356072Z         self,
2025-05-07T20:33:12.2356264Z         T: int,
2025-05-07T20:33:12.2356459Z         D: int,
2025-05-07T20:33:12.2356677Z         scale_ub: Optional[float],
2025-05-07T20:33:12.2356949Z         contiguous: bool,
2025-05-07T20:33:12.2357182Z         compiled: bool,
2025-05-07T20:33:12.2357409Z     ) -> None:
2025-05-07T20:33:12.2357628Z         torch.manual_seed(2025)
2025-05-07T20:33:12.2357925Z     
2025-05-07T20:33:12.2358197Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.2358539Z     
2025-05-07T20:33:12.2358731Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.2359020Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.2359504Z         x = x_sign * x_clamp
2025-05-07T20:33:12.2359736Z         x0 = x[:, :D]
2025-05-07T20:33:12.2359950Z         x1 = x[:, D:]
2025-05-07T20:33:12.2360190Z     
2025-05-07T20:33:12.2360389Z         if contiguous:
2025-05-07T20:33:12.2360613Z             x0 = x0.contiguous()
2025-05-07T20:33:12.2360874Z             x1 = x1.contiguous()
2025-05-07T20:33:12.2361112Z     
2025-05-07T20:33:12.2361303Z         if scale_ub is not None:
2025-05-07T20:33:12.2361572Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.2361897Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.2362202Z             )
2025-05-07T20:33:12.2362393Z         else:
2025-05-07T20:33:12.2362676Z             scale_ub_tensor = None
2025-05-07T20:33:12.2362932Z     
2025-05-07T20:33:12.2363163Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.2363473Z             op = silu_mul_quant
2025-05-07T20:33:12.2363715Z             if compiled:
2025-05-07T20:33:12.2363960Z                 op = torch.compile(op)
2025-05-07T20:33:12.2364252Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.2364615Z     
2025-05-07T20:33:12.2364810Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.2364971Z 
2025-05-07T20:33:12.2365072Z moe/activation_test.py:117: 
2025-05-07T20:33:12.2365357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.2365680Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.2365956Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.2366633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.2367374Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.2367906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.2368580Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.2369228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.2369756Z     kernel = self.compile(
2025-05-07T20:33:12.2370289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.2370930Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.2371317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.2371548Z 
2025-05-07T20:33:12.2371751Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1a262d0>
2025-05-07T20:33:12.2372817Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.2374228Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1a204a0>}
2025-05-07T20:33:12.2375545Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.2376552Z context = <triton._C.libtriton.ir.context object at 0x7f03a1a73370>
2025-05-07T20:33:12.2376836Z 
2025-05-07T20:33:12.2376998Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.2377510Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.2378040Z                            module_map=module_map)
2025-05-07T20:33:12.2378401Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.2378750Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.2379006Z E       ^
2025-05-07T20:33:12.2379459Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.2379946Z 
2025-05-07T20:33:12.2380369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.2380873Z 
2025-05-07T20:33:12.2380981Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.2381381Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.2381782Z     T=2048,
2025-05-07T20:33:12.2381973Z     D=7168,
2025-05-07T20:33:12.2382165Z     scale_ub=1200.0,
2025-05-07T20:33:12.2382379Z     contiguous=True,
2025-05-07T20:33:12.2382945Z     compiled=False,
2025-05-07T20:33:12.2383150Z )
2025-05-07T20:33:12.3188234Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.3189038Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:12.3189422Z 
2025-05-07T20:33:12.3189532Z     @given(
2025-05-07T20:33:12.3189787Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.3190285Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.3190597Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.3190929Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.3191252Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.3191545Z     )
2025-05-07T20:33:12.3191902Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.3192336Z     def test_silu_mul_quant(
2025-05-07T20:33:12.3192584Z         self,
2025-05-07T20:33:12.3192789Z         T: int,
2025-05-07T20:33:12.3193075Z         D: int,
2025-05-07T20:33:12.3193295Z         scale_ub: Optional[float],
2025-05-07T20:33:12.3193570Z         contiguous: bool,
2025-05-07T20:33:12.3193810Z         compiled: bool,
2025-05-07T20:33:12.3194030Z     ) -> None:
2025-05-07T20:33:12.3194245Z         torch.manual_seed(2025)
2025-05-07T20:33:12.3194489Z     
2025-05-07T20:33:12.3194754Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.3196796Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.3198635Z 
2025-05-07T20:33:12.3198753Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:12.3198968Z 
2025-05-07T20:33:12.3199070Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.3199478Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.3199870Z     T=1,
2025-05-07T20:33:12.3200061Z     D=5120,
2025-05-07T20:33:12.3200258Z     scale_ub=1200.0,
2025-05-07T20:33:12.3200472Z     contiguous=True,
2025-05-07T20:33:12.3200698Z     compiled=False,
2025-05-07T20:33:12.3200902Z )
2025-05-07T20:33:12.3201212Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.3201691Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:12.3201953Z 
2025-05-07T20:33:12.3202034Z     @given(
2025-05-07T20:33:12.3202260Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.3202572Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.3202967Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.3203299Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.3203620Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.3203906Z     )
2025-05-07T20:33:12.3204249Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.3204695Z     def test_silu_mul_quant(
2025-05-07T20:33:12.3204939Z         self,
2025-05-07T20:33:12.3205138Z         T: int,
2025-05-07T20:33:12.3205335Z         D: int,
2025-05-07T20:33:12.3205555Z         scale_ub: Optional[float],
2025-05-07T20:33:12.3205830Z         contiguous: bool,
2025-05-07T20:33:12.3206067Z         compiled: bool,
2025-05-07T20:33:12.3206292Z     ) -> None:
2025-05-07T20:33:12.3206514Z         torch.manual_seed(2025)
2025-05-07T20:33:12.3206749Z     
2025-05-07T20:33:12.3207019Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.3207446Z     
2025-05-07T20:33:12.3207640Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.3207933Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.3208244Z         x = x_sign * x_clamp
2025-05-07T20:33:12.3208492Z         x0 = x[:, :D]
2025-05-07T20:33:12.3208703Z         x1 = x[:, D:]
2025-05-07T20:33:12.3208913Z     
2025-05-07T20:33:12.3209104Z         if contiguous:
2025-05-07T20:33:12.3209378Z             x0 = x0.contiguous()
2025-05-07T20:33:12.3209634Z             x1 = x1.contiguous()
2025-05-07T20:33:12.3209896Z     
2025-05-07T20:33:12.3210108Z         if scale_ub is not None:
2025-05-07T20:33:12.3210383Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.3210715Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.3211016Z             )
2025-05-07T20:33:12.3211213Z         else:
2025-05-07T20:33:12.3211423Z             scale_ub_tensor = None
2025-05-07T20:33:12.3211670Z     
2025-05-07T20:33:12.3211948Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.3212265Z             op = silu_mul_quant
2025-05-07T20:33:12.3212510Z             if compiled:
2025-05-07T20:33:12.3212761Z                 op = torch.compile(op)
2025-05-07T20:33:12.3213153Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.3213436Z     
2025-05-07T20:33:12.3213624Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.3213791Z 
2025-05-07T20:33:12.3213894Z moe/activation_test.py:117: 
2025-05-07T20:33:12.3214194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.3214521Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.3214804Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.3215498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.3216183Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.3216726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.3217405Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.3218068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.3218591Z     kernel = self.compile(
2025-05-07T20:33:12.3219131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.3219801Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.3220235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.3220460Z 
2025-05-07T20:33:12.3220668Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1a26ba0>
2025-05-07T20:33:12.3221739Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.3223156Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1a21a80>}
2025-05-07T20:33:12.3224478Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.3225494Z context = <triton._C.libtriton.ir.context object at 0x7f03a16ac170>
2025-05-07T20:33:12.3225780Z 
2025-05-07T20:33:12.3225950Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.3226472Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.3226941Z                            module_map=module_map)
2025-05-07T20:33:12.3227350Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.3227709Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.3227972Z E       ^
2025-05-07T20:33:12.3228446Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.3228892Z 
2025-05-07T20:33:12.3229301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.3229858Z 
2025-05-07T20:33:12.3229987Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.3230423Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.3230819Z     T=2048,
2025-05-07T20:33:12.3231002Z     D=5120,
2025-05-07T20:33:12.3231197Z     scale_ub=None,
2025-05-07T20:33:12.3231418Z     contiguous=True,
2025-05-07T20:33:12.3231640Z     compiled=False,
2025-05-07T20:33:12.3231849Z )
2025-05-07T20:33:12.3232213Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.3232703Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:12.3232976Z 
2025-05-07T20:33:12.3233054Z     @given(
2025-05-07T20:33:12.3233284Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.3233614Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.3233914Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.3234252Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.3234582Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.3234876Z     )
2025-05-07T20:33:12.3235220Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.3235659Z     def test_silu_mul_quant(
2025-05-07T20:33:12.3244071Z         self,
2025-05-07T20:33:12.3244287Z         T: int,
2025-05-07T20:33:12.3244486Z         D: int,
2025-05-07T20:33:12.3244712Z         scale_ub: Optional[float],
2025-05-07T20:33:12.3245006Z         contiguous: bool,
2025-05-07T20:33:12.3245247Z         compiled: bool,
2025-05-07T20:33:12.3245483Z     ) -> None:
2025-05-07T20:33:12.3245704Z         torch.manual_seed(2025)
2025-05-07T20:33:12.3245947Z     
2025-05-07T20:33:12.3246225Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.3246577Z     
2025-05-07T20:33:12.3246775Z >       x_sign = torch.sign(x)
2025-05-07T20:33:12.3248708Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.3250631Z 
2025-05-07T20:33:12.3250755Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:12.3250977Z 
2025-05-07T20:33:12.3251081Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.3251509Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.3251907Z     T=16384,
2025-05-07T20:33:12.3252108Z     D=5120,
2025-05-07T20:33:12.3252312Z     scale_ub=None,
2025-05-07T20:33:12.3252533Z     contiguous=True,
2025-05-07T20:33:12.3252763Z     compiled=False,
2025-05-07T20:33:12.3253076Z )
2025-05-07T20:33:12.4003688Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.4004449Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:12.4004849Z 
2025-05-07T20:33:12.4004965Z     @given(
2025-05-07T20:33:12.4005294Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.4005647Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.4006076Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.4006421Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.4006756Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.4007041Z     )
2025-05-07T20:33:12.4007402Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.4007844Z     def test_silu_mul_quant(
2025-05-07T20:33:12.4008156Z         self,
2025-05-07T20:33:12.4008358Z         T: int,
2025-05-07T20:33:12.4008566Z         D: int,
2025-05-07T20:33:12.4008783Z         scale_ub: Optional[float],
2025-05-07T20:33:12.4009062Z         contiguous: bool,
2025-05-07T20:33:12.4009312Z         compiled: bool,
2025-05-07T20:33:12.4009537Z     ) -> None:
2025-05-07T20:33:12.4009765Z         torch.manual_seed(2025)
2025-05-07T20:33:12.4010057Z     
2025-05-07T20:33:12.4010334Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.4012444Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.4014374Z 
2025-05-07T20:33:12.4014494Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:12.4014714Z 
2025-05-07T20:33:12.4014821Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.4015240Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.4015641Z     T=4096,
2025-05-07T20:33:12.4015840Z     D=5120,
2025-05-07T20:33:12.4016038Z     scale_ub=None,
2025-05-07T20:33:12.4016260Z     contiguous=True,
2025-05-07T20:33:12.4016487Z     compiled=False,
2025-05-07T20:33:12.4016700Z )
2025-05-07T20:33:12.4017024Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.4017508Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:12.4017780Z 
2025-05-07T20:33:12.4017861Z     @given(
2025-05-07T20:33:12.4018090Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.4018403Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.4018714Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.4019043Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.4019365Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.4019655Z     )
2025-05-07T20:33:12.4020017Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.4020501Z     def test_silu_mul_quant(
2025-05-07T20:33:12.4020741Z         self,
2025-05-07T20:33:12.4021042Z         T: int,
2025-05-07T20:33:12.4021245Z         D: int,
2025-05-07T20:33:12.4021459Z         scale_ub: Optional[float],
2025-05-07T20:33:12.4021737Z         contiguous: bool,
2025-05-07T20:33:12.4021985Z         compiled: bool,
2025-05-07T20:33:12.4022209Z     ) -> None:
2025-05-07T20:33:12.4022429Z         torch.manual_seed(2025)
2025-05-07T20:33:12.4022677Z     
2025-05-07T20:33:12.4022946Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.4025012Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.4026848Z 
2025-05-07T20:33:12.4026967Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:12.4027184Z 
2025-05-07T20:33:12.4027291Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.4027707Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.4028101Z     T=2048,
2025-05-07T20:33:12.4028341Z     D=5120,
2025-05-07T20:33:12.4028539Z     scale_ub=None,
2025-05-07T20:33:12.4028750Z     contiguous=False,
2025-05-07T20:33:12.4028985Z     compiled=False,
2025-05-07T20:33:12.4029197Z )
2025-05-07T20:33:12.4029510Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.4029999Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:12.4030272Z 
2025-05-07T20:33:12.4030353Z     @given(
2025-05-07T20:33:12.4030592Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.4030950Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.4031260Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.4031593Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.4031919Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.4032212Z     )
2025-05-07T20:33:12.4032562Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.4032997Z     def test_silu_mul_quant(
2025-05-07T20:33:12.4033242Z         self,
2025-05-07T20:33:12.4033436Z         T: int,
2025-05-07T20:33:12.4033629Z         D: int,
2025-05-07T20:33:12.4033852Z         scale_ub: Optional[float],
2025-05-07T20:33:12.4034126Z         contiguous: bool,
2025-05-07T20:33:12.4034361Z         compiled: bool,
2025-05-07T20:33:12.4034588Z     ) -> None:
2025-05-07T20:33:12.4034805Z         torch.manual_seed(2025)
2025-05-07T20:33:12.4035051Z     
2025-05-07T20:33:12.4035326Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.4037337Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.4039160Z 
2025-05-07T20:33:12.4039282Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:12.4039490Z 
2025-05-07T20:33:12.4039605Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.4040063Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.4040475Z     T=4096,
2025-05-07T20:33:12.4040670Z     D=7168,
2025-05-07T20:33:12.4040918Z     scale_ub=None,
2025-05-07T20:33:12.4041131Z     contiguous=True,
2025-05-07T20:33:12.4041359Z     compiled=True,
2025-05-07T20:33:12.4041568Z )
2025-05-07T20:33:12.4041882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.4042369Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:12.4042632Z 
2025-05-07T20:33:12.4042727Z     @given(
2025-05-07T20:33:12.4042952Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.4043267Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.4043577Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.4043898Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.4044234Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.4044525Z     )
2025-05-07T20:33:12.4044876Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.4045355Z     def test_silu_mul_quant(
2025-05-07T20:33:12.4045605Z         self,
2025-05-07T20:33:12.4045815Z         T: int,
2025-05-07T20:33:12.4046011Z         D: int,
2025-05-07T20:33:12.4046240Z         scale_ub: Optional[float],
2025-05-07T20:33:12.4046519Z         contiguous: bool,
2025-05-07T20:33:12.4046755Z         compiled: bool,
2025-05-07T20:33:12.4046986Z     ) -> None:
2025-05-07T20:33:12.4047201Z         torch.manual_seed(2025)
2025-05-07T20:33:12.4047491Z     
2025-05-07T20:33:12.4047758Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.4049809Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.4051690Z 
2025-05-07T20:33:12.4051807Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:12.4052029Z 
2025-05-07T20:33:12.4052130Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.4052542Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.4052935Z     T=2048,
2025-05-07T20:33:12.4053194Z     D=5120,
2025-05-07T20:33:12.4053389Z     scale_ub=1200.0,
2025-05-07T20:33:12.4053606Z     contiguous=False,
2025-05-07T20:33:12.4053837Z     compiled=False,
2025-05-07T20:33:12.4054042Z )
2025-05-07T20:33:12.4054353Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.4054843Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:12.4055113Z 
2025-05-07T20:33:12.4055193Z     @given(
2025-05-07T20:33:12.4055422Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.4055731Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.4056034Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.4056371Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.4056688Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.4056975Z     )
2025-05-07T20:33:12.4057324Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.4057752Z     def test_silu_mul_quant(
2025-05-07T20:33:12.4057997Z         self,
2025-05-07T20:33:12.4058196Z         T: int,
2025-05-07T20:33:12.4058388Z         D: int,
2025-05-07T20:33:12.4058608Z         scale_ub: Optional[float],
2025-05-07T20:33:12.4058879Z         contiguous: bool,
2025-05-07T20:33:12.4059117Z         compiled: bool,
2025-05-07T20:33:12.4060120Z     ) -> None:
2025-05-07T20:33:12.4060339Z         torch.manual_seed(2025)
2025-05-07T20:33:12.4060591Z     
2025-05-07T20:33:12.4060942Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.4062962Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.4064785Z 
2025-05-07T20:33:12.4064902Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:12.4065111Z 
2025-05-07T20:33:12.4065221Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.4065624Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.4066090Z     T=4096,
2025-05-07T20:33:12.4066284Z     D=7168,
2025-05-07T20:33:12.4066471Z     scale_ub=1200.0,
2025-05-07T20:33:12.4066696Z     contiguous=True,
2025-05-07T20:33:12.4066919Z     compiled=False,
2025-05-07T20:33:12.4067122Z )
2025-05-07T20:33:12.5134179Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.5134923Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:12.5135458Z 
2025-05-07T20:33:12.5135570Z     @given(
2025-05-07T20:33:12.5135892Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.5136305Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.5136608Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.5136935Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.5137253Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.5137542Z     )
2025-05-07T20:33:12.5138002Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.5138442Z     def test_silu_mul_quant(
2025-05-07T20:33:12.5138674Z         self,
2025-05-07T20:33:12.5138871Z         T: int,
2025-05-07T20:33:12.5139067Z         D: int,
2025-05-07T20:33:12.5139279Z         scale_ub: Optional[float],
2025-05-07T20:33:12.5139553Z         contiguous: bool,
2025-05-07T20:33:12.5139795Z         compiled: bool,
2025-05-07T20:33:12.5140016Z     ) -> None:
2025-05-07T20:33:12.5140239Z         torch.manual_seed(2025)
2025-05-07T20:33:12.5140484Z     
2025-05-07T20:33:12.5140750Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.5142776Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.5144612Z 
2025-05-07T20:33:12.5144730Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:12.5144947Z 
2025-05-07T20:33:12.5145049Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.5145458Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.5145848Z     T=16384,
2025-05-07T20:33:12.5146048Z     D=7168,
2025-05-07T20:33:12.5146243Z     scale_ub=None,
2025-05-07T20:33:12.5146449Z     contiguous=False,
2025-05-07T20:33:12.5146674Z     compiled=True,
2025-05-07T20:33:12.5146874Z )
2025-05-07T20:33:12.5147186Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.5147674Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:12.5148020Z 
2025-05-07T20:33:12.5148109Z     @given(
2025-05-07T20:33:12.5148337Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.5148645Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.5148947Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.5149279Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.5149598Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.5149916Z     )
2025-05-07T20:33:12.5150286Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.5150717Z     def test_silu_mul_quant(
2025-05-07T20:33:12.5150960Z         self,
2025-05-07T20:33:12.5151160Z         T: int,
2025-05-07T20:33:12.5151355Z         D: int,
2025-05-07T20:33:12.5151572Z         scale_ub: Optional[float],
2025-05-07T20:33:12.5151844Z         contiguous: bool,
2025-05-07T20:33:12.5152077Z         compiled: bool,
2025-05-07T20:33:12.5152302Z     ) -> None:
2025-05-07T20:33:12.5152591Z         torch.manual_seed(2025)
2025-05-07T20:33:12.5152835Z     
2025-05-07T20:33:12.5153100Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.5155102Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.5156964Z 
2025-05-07T20:33:12.5157078Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:12.5157284Z 
2025-05-07T20:33:12.5157390Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.5157833Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.5158229Z     T=4096,
2025-05-07T20:33:12.5158413Z     D=7168,
2025-05-07T20:33:12.5158608Z     scale_ub=None,
2025-05-07T20:33:12.5158815Z     contiguous=True,
2025-05-07T20:33:12.5159036Z     compiled=False,
2025-05-07T20:33:12.5159429Z )
2025-05-07T20:33:12.5159744Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.5160279Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:12.5160543Z 
2025-05-07T20:33:12.5160627Z     @given(
2025-05-07T20:33:12.5160854Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.5161162Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.5161464Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.5161781Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.5162106Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.5162397Z     )
2025-05-07T20:33:12.5162744Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.5163175Z     def test_silu_mul_quant(
2025-05-07T20:33:12.5163414Z         self,
2025-05-07T20:33:12.5163608Z         T: int,
2025-05-07T20:33:12.5163799Z         D: int,
2025-05-07T20:33:12.5164015Z         scale_ub: Optional[float],
2025-05-07T20:33:12.5164291Z         contiguous: bool,
2025-05-07T20:33:12.5164526Z         compiled: bool,
2025-05-07T20:33:12.5164745Z     ) -> None:
2025-05-07T20:33:12.5164960Z         torch.manual_seed(2025)
2025-05-07T20:33:12.5165204Z     
2025-05-07T20:33:12.5165471Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.5167476Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.5169833Z 
2025-05-07T20:33:12.5169951Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:12.5170166Z 
2025-05-07T20:33:12.5170275Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.5170683Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.5171081Z     T=16384,
2025-05-07T20:33:12.5171278Z     D=7168,
2025-05-07T20:33:12.5171463Z     scale_ub=None,
2025-05-07T20:33:12.5171680Z     contiguous=True,
2025-05-07T20:33:12.5171903Z     compiled=False,
2025-05-07T20:33:12.5172131Z )
2025-05-07T20:33:12.5172442Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.5173076Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:12.5173356Z 
2025-05-07T20:33:12.5173440Z     @given(
2025-05-07T20:33:12.5173662Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.5173975Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.5174281Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.5174605Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.5174992Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.5175276Z     )
2025-05-07T20:33:12.5175624Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.5176052Z     def test_silu_mul_quant(
2025-05-07T20:33:12.5176289Z         self,
2025-05-07T20:33:12.5176485Z         T: int,
2025-05-07T20:33:12.5176683Z         D: int,
2025-05-07T20:33:12.5176903Z         scale_ub: Optional[float],
2025-05-07T20:33:12.5177171Z         contiguous: bool,
2025-05-07T20:33:12.5177472Z         compiled: bool,
2025-05-07T20:33:12.5177702Z     ) -> None:
2025-05-07T20:33:12.5177915Z         torch.manual_seed(2025)
2025-05-07T20:33:12.5178157Z     
2025-05-07T20:33:12.5178431Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.5180491Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.5182316Z 
2025-05-07T20:33:12.5182432Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:12.5182639Z 
2025-05-07T20:33:12.5182755Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.5183160Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.5183560Z     T=16384,
2025-05-07T20:33:12.5183752Z     D=7168,
2025-05-07T20:33:12.5183938Z     scale_ub=1200.0,
2025-05-07T20:33:12.5184161Z     contiguous=True,
2025-05-07T20:33:12.5184383Z     compiled=False,
2025-05-07T20:33:12.5184588Z )
2025-05-07T20:33:12.5184907Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.5185397Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:12.5185669Z 
2025-05-07T20:33:12.5185755Z     @given(
2025-05-07T20:33:12.5185976Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.5186284Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.5186592Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.5186913Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.5187292Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.5187582Z     )
2025-05-07T20:33:12.5187924Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.5188353Z     def test_silu_mul_quant(
2025-05-07T20:33:12.5188599Z         self,
2025-05-07T20:33:12.5188801Z         T: int,
2025-05-07T20:33:12.5188994Z         D: int,
2025-05-07T20:33:12.5189222Z         scale_ub: Optional[float],
2025-05-07T20:33:12.5189498Z         contiguous: bool,
2025-05-07T20:33:12.5189733Z         compiled: bool,
2025-05-07T20:33:12.5189957Z     ) -> None:
2025-05-07T20:33:12.5190175Z         torch.manual_seed(2025)
2025-05-07T20:33:12.5190413Z     
2025-05-07T20:33:12.5190687Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.5192743Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.5194600Z 
2025-05-07T20:33:12.5194723Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:12.5194929Z 
2025-05-07T20:33:12.5195034Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.5195437Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.5195832Z     T=128,
2025-05-07T20:33:12.5196020Z     D=5120,
2025-05-07T20:33:12.5196206Z     scale_ub=1200.0,
2025-05-07T20:33:12.5196430Z     contiguous=False,
2025-05-07T20:33:12.5196655Z     compiled=False,
2025-05-07T20:33:12.5196855Z )
2025-05-07T20:33:12.6480546Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.6481788Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:12.6482299Z 
2025-05-07T20:33:12.6482441Z     @given(
2025-05-07T20:33:12.6482858Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.6483435Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.6483990Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.6484606Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.6485214Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.6485740Z     )
2025-05-07T20:33:12.6486380Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.6487186Z     def test_silu_mul_quant(
2025-05-07T20:33:12.6487633Z         self,
2025-05-07T20:33:12.6487992Z         T: int,
2025-05-07T20:33:12.6488342Z         D: int,
2025-05-07T20:33:12.6488757Z         scale_ub: Optional[float],
2025-05-07T20:33:12.6489268Z         contiguous: bool,
2025-05-07T20:33:12.6489697Z         compiled: bool,
2025-05-07T20:33:12.6490115Z     ) -> None:
2025-05-07T20:33:12.6490494Z         torch.manual_seed(2025)
2025-05-07T20:33:12.6490764Z     
2025-05-07T20:33:12.6491035Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.6491380Z     
2025-05-07T20:33:12.6491577Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.6491867Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.6499753Z         x = x_sign * x_clamp
2025-05-07T20:33:12.6500065Z         x0 = x[:, :D]
2025-05-07T20:33:12.6500299Z         x1 = x[:, D:]
2025-05-07T20:33:12.6500518Z     
2025-05-07T20:33:12.6500707Z         if contiguous:
2025-05-07T20:33:12.6500949Z             x0 = x0.contiguous()
2025-05-07T20:33:12.6501218Z             x1 = x1.contiguous()
2025-05-07T20:33:12.6501464Z     
2025-05-07T20:33:12.6501667Z         if scale_ub is not None:
2025-05-07T20:33:12.6501952Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.6502410Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.6502719Z             )
2025-05-07T20:33:12.6502924Z         else:
2025-05-07T20:33:12.6503140Z             scale_ub_tensor = None
2025-05-07T20:33:12.6503394Z     
2025-05-07T20:33:12.6503638Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.6503962Z             op = silu_mul_quant
2025-05-07T20:33:12.6504211Z             if compiled:
2025-05-07T20:33:12.6504470Z                 op = torch.compile(op)
2025-05-07T20:33:12.6504760Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.6505024Z     
2025-05-07T20:33:12.6505212Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.6505372Z 
2025-05-07T20:33:12.6505471Z moe/activation_test.py:117: 
2025-05-07T20:33:12.6505759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.6506085Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.6506475Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.6507163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.6507855Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.6508396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.6509134Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.6509808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.6510367Z     kernel = self.compile(
2025-05-07T20:33:12.6510909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.6511554Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.6511986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.6512221Z 
2025-05-07T20:33:12.6512427Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a15db530>
2025-05-07T20:33:12.6513493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.6514859Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a14107c0>}
2025-05-07T20:33:12.6516183Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.6517189Z context = <triton._C.libtriton.ir.context object at 0x7f03a14b70f0>
2025-05-07T20:33:12.6517483Z 
2025-05-07T20:33:12.6517652Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.6518167Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.6518636Z                            module_map=module_map)
2025-05-07T20:33:12.6519002Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.6519358Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.6519617Z E       ^
2025-05-07T20:33:12.6520074Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.6520573Z 
2025-05-07T20:33:12.6520984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.6521506Z 
2025-05-07T20:33:12.6521610Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.6522031Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.6522472Z     T=2048,
2025-05-07T20:33:12.6522669Z     D=7168,
2025-05-07T20:33:12.6522865Z     scale_ub=None,
2025-05-07T20:33:12.6523072Z     contiguous=False,
2025-05-07T20:33:12.6523298Z     compiled=False,
2025-05-07T20:33:12.6523503Z )
2025-05-07T20:33:12.6523813Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.6524303Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:12.6524576Z 
2025-05-07T20:33:12.6524662Z     @given(
2025-05-07T20:33:12.6524885Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.6525199Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.6525505Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.6525841Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.6526164Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.6526445Z     )
2025-05-07T20:33:12.6526837Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.6527271Z     def test_silu_mul_quant(
2025-05-07T20:33:12.6527516Z         self,
2025-05-07T20:33:12.6527716Z         T: int,
2025-05-07T20:33:12.6527908Z         D: int,
2025-05-07T20:33:12.6528133Z         scale_ub: Optional[float],
2025-05-07T20:33:12.6528406Z         contiguous: bool,
2025-05-07T20:33:12.6528689Z         compiled: bool,
2025-05-07T20:33:12.6528915Z     ) -> None:
2025-05-07T20:33:12.6529135Z         torch.manual_seed(2025)
2025-05-07T20:33:12.6529370Z     
2025-05-07T20:33:12.6529640Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.6531701Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.6533592Z 
2025-05-07T20:33:12.6533712Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:12.6533923Z 
2025-05-07T20:33:12.6534039Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.6534441Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.6534830Z     T=128,
2025-05-07T20:33:12.6535012Z     D=7168,
2025-05-07T20:33:12.6535200Z     scale_ub=1200.0,
2025-05-07T20:33:12.6535424Z     contiguous=True,
2025-05-07T20:33:12.6535649Z     compiled=True,
2025-05-07T20:33:12.6535852Z )
2025-05-07T20:33:12.6835868Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.6836623Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:12.6837001Z 
2025-05-07T20:33:12.6837109Z     @given(
2025-05-07T20:33:12.6837418Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.6837829Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.6838133Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.6838468Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.6838801Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.6839079Z     )
2025-05-07T20:33:12.6839429Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.6839911Z     def test_silu_mul_quant(
2025-05-07T20:33:12.6840174Z         self,
2025-05-07T20:33:12.6840375Z         T: int,
2025-05-07T20:33:12.6840578Z         D: int,
2025-05-07T20:33:12.6840797Z         scale_ub: Optional[float],
2025-05-07T20:33:12.6841070Z         contiguous: bool,
2025-05-07T20:33:12.6841315Z         compiled: bool,
2025-05-07T20:33:12.6841661Z     ) -> None:
2025-05-07T20:33:12.6841876Z         torch.manual_seed(2025)
2025-05-07T20:33:12.6842122Z     
2025-05-07T20:33:12.6842400Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.6842738Z     
2025-05-07T20:33:12.6842935Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.6843225Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.6843531Z         x = x_sign * x_clamp
2025-05-07T20:33:12.6843778Z         x0 = x[:, :D]
2025-05-07T20:33:12.6844004Z         x1 = x[:, D:]
2025-05-07T20:33:12.6844206Z     
2025-05-07T20:33:12.6844397Z         if contiguous:
2025-05-07T20:33:12.6844639Z             x0 = x0.contiguous()
2025-05-07T20:33:12.6844894Z             x1 = x1.contiguous()
2025-05-07T20:33:12.6845141Z     
2025-05-07T20:33:12.6845336Z         if scale_ub is not None:
2025-05-07T20:33:12.6845604Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.6846010Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.6846328Z             )
2025-05-07T20:33:12.6846527Z         else:
2025-05-07T20:33:12.6846736Z             scale_ub_tensor = None
2025-05-07T20:33:12.6846990Z     
2025-05-07T20:33:12.6847226Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.6847542Z             op = silu_mul_quant
2025-05-07T20:33:12.6847794Z             if compiled:
2025-05-07T20:33:12.6848048Z                 op = torch.compile(op)
2025-05-07T20:33:12.6848414Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.6848690Z     
2025-05-07T20:33:12.6848888Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.6849049Z 
2025-05-07T20:33:12.6849148Z moe/activation_test.py:117: 
2025-05-07T20:33:12.6849445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.6849769Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.6850066Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.6850730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:12.6851285Z     return fn(*args, **kwargs)
2025-05-07T20:33:12.6851934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.6852611Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.6853223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.6853901Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.6854553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.6855079Z     kernel = self.compile(
2025-05-07T20:33:12.6855612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.6856363Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.6856818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.6857080Z 
2025-05-07T20:33:12.6857320Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a139f5c0>
2025-05-07T20:33:12.6858626Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.6860330Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1411940>}
2025-05-07T20:33:12.6861648Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.6862666Z context = <triton._C.libtriton.ir.context object at 0x7f03a13dcb70>
2025-05-07T20:33:12.6863019Z 
2025-05-07T20:33:12.6863189Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.6863696Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.6864158Z                            module_map=module_map)
2025-05-07T20:33:12.6864523Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.6864870Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.6865125Z E       ^
2025-05-07T20:33:12.6865584Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.6866024Z 
2025-05-07T20:33:12.6866442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.6866945Z 
2025-05-07T20:33:12.6867118Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.6867538Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.6867939Z     T=128,
2025-05-07T20:33:12.6868121Z     D=7168,
2025-05-07T20:33:12.6868316Z     scale_ub=1200.0,
2025-05-07T20:33:12.6868544Z     contiguous=True,
2025-05-07T20:33:12.6868764Z     compiled=False,
2025-05-07T20:33:12.6868973Z )
2025-05-07T20:33:12.6869292Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.6869870Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:12.6870165Z 
2025-05-07T20:33:12.6870244Z     @given(
2025-05-07T20:33:12.6870480Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.6870788Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.6871093Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.6871420Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.6871753Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.6872100Z     )
2025-05-07T20:33:12.6872450Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.6872963Z     def test_silu_mul_quant(
2025-05-07T20:33:12.6873215Z         self,
2025-05-07T20:33:12.6873421Z         T: int,
2025-05-07T20:33:12.6873633Z         D: int,
2025-05-07T20:33:12.6873862Z         scale_ub: Optional[float],
2025-05-07T20:33:12.6874161Z         contiguous: bool,
2025-05-07T20:33:12.6874416Z         compiled: bool,
2025-05-07T20:33:12.6874647Z     ) -> None:
2025-05-07T20:33:12.6874874Z         torch.manual_seed(2025)
2025-05-07T20:33:12.6875137Z     
2025-05-07T20:33:12.6875424Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.6875808Z     
2025-05-07T20:33:12.6876008Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.6876322Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.6878818Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.6881215Z 
2025-05-07T20:33:12.6881345Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:12.6881590Z 
2025-05-07T20:33:12.6881698Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.6882168Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.6882621Z     T=128,
2025-05-07T20:33:12.6882819Z     D=5120,
2025-05-07T20:33:12.6883019Z     scale_ub=1200.0,
2025-05-07T20:33:12.6883259Z     contiguous=True,
2025-05-07T20:33:12.6883563Z     compiled=True,
2025-05-07T20:33:12.6883773Z )
2025-05-07T20:33:12.6884094Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.6884571Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:12.6884837Z 
2025-05-07T20:33:12.6884915Z     @given(
2025-05-07T20:33:12.6885145Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.6885453Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.6885753Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.6886079Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.6886402Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.6886689Z     )
2025-05-07T20:33:12.6887039Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.6887478Z     def test_silu_mul_quant(
2025-05-07T20:33:12.6887717Z         self,
2025-05-07T20:33:12.6887964Z         T: int,
2025-05-07T20:33:12.6888171Z         D: int,
2025-05-07T20:33:12.6888386Z         scale_ub: Optional[float],
2025-05-07T20:33:12.6888661Z         contiguous: bool,
2025-05-07T20:33:12.6888902Z         compiled: bool,
2025-05-07T20:33:12.6889118Z     ) -> None:
2025-05-07T20:33:12.6889346Z         torch.manual_seed(2025)
2025-05-07T20:33:12.6889589Z     
2025-05-07T20:33:12.6889854Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.6890233Z     
2025-05-07T20:33:12.6890426Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.6890713Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.6892706Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.6894575Z 
2025-05-07T20:33:12.6894693Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:12.6894906Z 
2025-05-07T20:33:12.6895008Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.6895418Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.6895810Z     T=128,
2025-05-07T20:33:12.6896002Z     D=7168,
2025-05-07T20:33:12.6896191Z     scale_ub=None,
2025-05-07T20:33:12.6896400Z     contiguous=True,
2025-05-07T20:33:12.6896620Z     compiled=True,
2025-05-07T20:33:12.6896819Z )
2025-05-07T20:33:12.9379303Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9379959Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:12.9380321Z 
2025-05-07T20:33:12.9380420Z     @given(
2025-05-07T20:33:12.9380660Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9380974Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9381281Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9381616Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9381946Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9382232Z     )
2025-05-07T20:33:12.9382582Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9383020Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9383261Z         self,
2025-05-07T20:33:12.9383463Z         T: int,
2025-05-07T20:33:12.9383668Z         D: int,
2025-05-07T20:33:12.9383891Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9384164Z         contiguous: bool,
2025-05-07T20:33:12.9384409Z         compiled: bool,
2025-05-07T20:33:12.9384632Z     ) -> None:
2025-05-07T20:33:12.9384859Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9385224Z     
2025-05-07T20:33:12.9385501Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9387510Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.9389333Z 
2025-05-07T20:33:12.9389454Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:12.9389668Z 
2025-05-07T20:33:12.9405485Z FAILED
2025-05-07T20:33:12.9405754Z 
2025-05-07T20:33:12.9406137Z =================================== FAILURES ===================================
2025-05-07T20:33:12.9406779Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:33:12.9407376Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:33:12.9408203Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:33:12.9408940Z   |     yield
2025-05-07T20:33:12.9409625Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run
2025-05-07T20:33:12.9410339Z   |     self._callTestMethod(testMethod)
2025-05-07T20:33:12.9411115Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
2025-05-07T20:33:12.9411866Z   |     if method() is not None:
2025-05-07T20:33:12.9412198Z   |        ^^^^^^^^
2025-05-07T20:33:12.9413242Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:33:12.9414234Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9414638Z   |            ^^^^^^^
2025-05-07T20:33:12.9415386Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:33:12.9416436Z   |     raise the_error_hypothesis_found
2025-05-07T20:33:12.9417016Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:33:12.9417601Z   +-+---------------- 1 ----------------
2025-05-07T20:33:12.9418003Z     | Traceback (most recent call last):
2025-05-07T20:33:12.9418965Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:12.9420010Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9420522Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:12.9423228Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.9425906Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:12.9426502Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9427063Z     |     T=2048,
2025-05-07T20:33:12.9427378Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:12.9427845Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:12.9428513Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:12.9429007Z     |     compiled=False,  # or any other generated value
2025-05-07T20:33:12.9429408Z     | )
2025-05-07T20:33:12.9429693Z     | 
2025-05-07T20:33:12.9430398Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:33:12.9431217Z     +---------------- 2 ----------------
2025-05-07T20:33:12.9431608Z     | Traceback (most recent call last):
2025-05-07T20:33:12.9432605Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:12.9433652Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9434142Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:12.9436904Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.9439604Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:12.9440241Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9440827Z     |     T=128,
2025-05-07T20:33:12.9441095Z     |     D=7168,
2025-05-07T20:33:12.9441375Z     |     scale_ub=None,
2025-05-07T20:33:12.9441703Z     |     contiguous=True,
2025-05-07T20:33:12.9442028Z     |     compiled=True,
2025-05-07T20:33:12.9442342Z     | )
2025-05-07T20:33:12.9442646Z     | 
2025-05-07T20:33:12.9443353Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:12.9444164Z     +---------------- 3 ----------------
2025-05-07T20:33:12.9444558Z     | Traceback (most recent call last):
2025-05-07T20:33:12.9445497Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:12.9446554Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9447087Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:12.9449412Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:12.9451358Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:12.9451791Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9452198Z     |     T=128,
2025-05-07T20:33:12.9452398Z     |     D=5120,
2025-05-07T20:33:12.9470653Z     |     scale_ub=1200.0,
2025-05-07T20:33:12.9471003Z     |     contiguous=True,
2025-05-07T20:33:12.9471332Z     |     compiled=True,
2025-05-07T20:33:12.9471635Z     | )
2025-05-07T20:33:12.9471882Z     | 
2025-05-07T20:33:12.9472627Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:33:12.9473686Z     +---------------- 4 ----------------
2025-05-07T20:33:12.9474088Z     | Traceback (most recent call last):
2025-05-07T20:33:12.9475064Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:33:12.9476011Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:12.9476388Z     |                              ^^^^^^^^
2025-05-07T20:33:12.9477247Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:33:12.9478204Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9478656Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:12.9479868Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:33:12.9480957Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:12.9481785Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:33:12.9482772Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9483365Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:12.9484357Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:33:12.9485405Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:12.9486042Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:12.9487002Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:33:12.9487963Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:12.9488469Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:12.9489280Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:33:12.9490063Z     |     fn()
2025-05-07T20:33:12.9490845Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:33:12.9491689Z     |     self.fn.run(
2025-05-07T20:33:12.9492401Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:33:12.9493287Z     |     kernel = self.compile(
2025-05-07T20:33:12.9493651Z     |              ^^^^^^^^^^^^^
2025-05-07T20:33:12.9494451Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:33:12.9495409Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9495931Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:12.9496781Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:12.9497860Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9498504Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:12.9499027Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9499503Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:33:12.9499869Z     | ^
2025-05-07T20:33:12.9500554Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9501377Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:12.9501911Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:33:12.9502611Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9503199Z     |     T=1,  # or any other generated value
2025-05-07T20:33:12.9503633Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:12.9504112Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:12.9504605Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:12.9505096Z     |     compiled=True,  # or any other generated value
2025-05-07T20:33:12.9505508Z     | )
2025-05-07T20:33:12.9505759Z     | 
2025-05-07T20:33:12.9506489Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:12.9507372Z     +------------------------------------
2025-05-07T20:33:12.9507877Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:33:12.9508397Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9508963Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9509486Z     T=1,
2025-05-07T20:33:12.9509732Z     D=5120,
2025-05-07T20:33:12.9509994Z     scale_ub=None,
2025-05-07T20:33:12.9510377Z     contiguous=True,
2025-05-07T20:33:12.9510671Z     compiled=True,
2025-05-07T20:33:12.9510939Z )
2025-05-07T20:33:12.9511349Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9511965Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:12.9512292Z 
2025-05-07T20:33:12.9512407Z     @given(
2025-05-07T20:33:12.9512696Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9513098Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9513556Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9513978Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9514411Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9514792Z     )
2025-05-07T20:33:12.9515242Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9515823Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9516139Z         self,
2025-05-07T20:33:12.9516387Z         T: int,
2025-05-07T20:33:12.9516635Z         D: int,
2025-05-07T20:33:12.9516918Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9517276Z         contiguous: bool,
2025-05-07T20:33:12.9517576Z         compiled: bool,
2025-05-07T20:33:12.9517866Z     ) -> None:
2025-05-07T20:33:12.9518142Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9518450Z     
2025-05-07T20:33:12.9518799Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9519247Z     
2025-05-07T20:33:12.9519512Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9519896Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9520327Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9520667Z         x0 = x[:, :D]
2025-05-07T20:33:12.9520964Z         x1 = x[:, D:]
2025-05-07T20:33:12.9521260Z     
2025-05-07T20:33:12.9521514Z         if contiguous:
2025-05-07T20:33:12.9521816Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9522143Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9522449Z     
2025-05-07T20:33:12.9522703Z         if scale_ub is not None:
2025-05-07T20:33:12.9523073Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9523521Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9523925Z             )
2025-05-07T20:33:12.9524183Z         else:
2025-05-07T20:33:12.9524465Z             scale_ub_tensor = None
2025-05-07T20:33:12.9524809Z     
2025-05-07T20:33:12.9525123Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9525605Z             op = silu_mul_quant
2025-05-07T20:33:12.9525926Z             if compiled:
2025-05-07T20:33:12.9526248Z                 op = torch.compile(op)
2025-05-07T20:33:12.9526629Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9526982Z     
2025-05-07T20:33:12.9527230Z         y_fp8, y_scale = fn()
2025-05-07T20:33:12.9527604Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:12.9527979Z     
2025-05-07T20:33:12.9528293Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9528727Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:12.9529108Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:12.9529511Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:12.9529980Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9530384Z     
2025-05-07T20:33:12.9530644Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:12.9530905Z 
2025-05-07T20:33:12.9531096Z moe/activation_test.py:126: 
2025-05-07T20:33:12.9531480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9531906Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:12.9532332Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9533457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:12.9534521Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:12.9535220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9536117Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9537061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:12.9538154Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:12.9539160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:12.9540015Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:12.9540822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:12.9541515Z     fn()
2025-05-07T20:33:12.9542224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:12.9543027Z     self.fn.run(
2025-05-07T20:33:12.9543689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9544367Z     kernel = self.compile(
2025-05-07T20:33:12.9545056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9545905Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9546407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9546707Z 
2025-05-07T20:33:12.9546966Z self = <triton.compiler.compiler.ASTSource object at 0x7f057ec4de20>
2025-05-07T20:33:12.9548343Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9550212Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f057db01c60>}
2025-05-07T20:33:12.9552008Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9553489Z context = <triton._C.libtriton.ir.context object at 0x7f057dd3d170>
2025-05-07T20:33:12.9553869Z 
2025-05-07T20:33:12.9554079Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9554748Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9555351Z                            module_map=module_map)
2025-05-07T20:33:12.9555816Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9556272Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:12.9556615Z E       ^
2025-05-07T20:33:12.9557223Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9557838Z 
2025-05-07T20:33:12.9558392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9559076Z 
2025-05-07T20:33:12.9559635Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9560207Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9560783Z     T=2048,
2025-05-07T20:33:12.9561028Z     D=5120,
2025-05-07T20:33:12.9561282Z     scale_ub=1200.0,
2025-05-07T20:33:12.9561592Z     contiguous=True,
2025-05-07T20:33:12.9561894Z     compiled=False,
2025-05-07T20:33:12.9562168Z )
2025-05-07T20:33:12.9562705Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9563388Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:12.9563757Z 
2025-05-07T20:33:12.9563866Z     @given(
2025-05-07T20:33:12.9564175Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9564605Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9565030Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9565479Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9566024Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9566416Z     )
2025-05-07T20:33:12.9566881Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9567477Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9567808Z         self,
2025-05-07T20:33:12.9568072Z         T: int,
2025-05-07T20:33:12.9568347Z         D: int,
2025-05-07T20:33:12.9568654Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9569000Z         contiguous: bool,
2025-05-07T20:33:12.9569305Z         compiled: bool,
2025-05-07T20:33:12.9569591Z     ) -> None:
2025-05-07T20:33:12.9569864Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9570234Z     
2025-05-07T20:33:12.9570588Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9571025Z     
2025-05-07T20:33:12.9571278Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9571659Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9572072Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9572388Z         x0 = x[:, :D]
2025-05-07T20:33:12.9572677Z         x1 = x[:, D:]
2025-05-07T20:33:12.9573092Z     
2025-05-07T20:33:12.9573350Z         if contiguous:
2025-05-07T20:33:12.9573658Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9573999Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9574313Z     
2025-05-07T20:33:12.9574567Z         if scale_ub is not None:
2025-05-07T20:33:12.9574944Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9575389Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9575809Z             )
2025-05-07T20:33:12.9576075Z         else:
2025-05-07T20:33:12.9576359Z             scale_ub_tensor = None
2025-05-07T20:33:12.9576703Z     
2025-05-07T20:33:12.9577016Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9577436Z             op = silu_mul_quant
2025-05-07T20:33:12.9577775Z             if compiled:
2025-05-07T20:33:12.9578119Z                 op = torch.compile(op)
2025-05-07T20:33:12.9578591Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9578952Z     
2025-05-07T20:33:12.9579207Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.9579425Z 
2025-05-07T20:33:12.9579564Z moe/activation_test.py:117: 
2025-05-07T20:33:12.9579947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9580384Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.9580768Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9581712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.9582673Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.9583414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9584335Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9585297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9586024Z     kernel = self.compile(
2025-05-07T20:33:12.9586763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9587626Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9588225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9588529Z 
2025-05-07T20:33:12.9588802Z self = <triton.compiler.compiler.ASTSource object at 0x7f057db08590>
2025-05-07T20:33:12.9590314Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9592226Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f057d958220>}
2025-05-07T20:33:12.9594062Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9595434Z context = <triton._C.libtriton.ir.context object at 0x7f057dd0d430>
2025-05-07T20:33:12.9595817Z 
2025-05-07T20:33:12.9596047Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9596764Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9597374Z                            module_map=module_map)
2025-05-07T20:33:12.9597864Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9598347Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.9598701Z E       ^
2025-05-07T20:33:12.9599367Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9600008Z 
2025-05-07T20:33:12.9600633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9601329Z 
2025-05-07T20:33:12.9601478Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9602016Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9602562Z     T=2048,
2025-05-07T20:33:12.9602815Z     D=5120,
2025-05-07T20:33:12.9603075Z     scale_ub=1200.0,
2025-05-07T20:33:12.9603366Z     contiguous=True,
2025-05-07T20:33:12.9603670Z     compiled=True,
2025-05-07T20:33:12.9603945Z )
2025-05-07T20:33:12.9604367Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9605003Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:12.9605349Z 
2025-05-07T20:33:12.9605475Z     @given(
2025-05-07T20:33:12.9605852Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9606290Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9606704Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9607138Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9607578Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9607967Z     )
2025-05-07T20:33:12.9608436Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9609035Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9609367Z         self,
2025-05-07T20:33:12.9609620Z         T: int,
2025-05-07T20:33:12.9609888Z         D: int,
2025-05-07T20:33:12.9610021Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9610153Z         contiguous: bool,
2025-05-07T20:33:12.9610269Z         compiled: bool,
2025-05-07T20:33:12.9610383Z     ) -> None:
2025-05-07T20:33:12.9610577Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9610687Z     
2025-05-07T20:33:12.9610930Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9611034Z     
2025-05-07T20:33:12.9611164Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9611348Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9611473Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9611585Z         x0 = x[:, :D]
2025-05-07T20:33:12.9611758Z         x1 = x[:, D:]
2025-05-07T20:33:12.9611861Z     
2025-05-07T20:33:12.9611981Z         if contiguous:
2025-05-07T20:33:12.9612115Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9612240Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9612348Z     
2025-05-07T20:33:12.9612476Z         if scale_ub is not None:
2025-05-07T20:33:12.9612626Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9612825Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9612933Z             )
2025-05-07T20:33:12.9613180Z         else:
2025-05-07T20:33:12.9613372Z             scale_ub_tensor = None
2025-05-07T20:33:12.9613448Z     
2025-05-07T20:33:12.9613582Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9613678Z             op = silu_mul_quant
2025-05-07T20:33:12.9613762Z             if compiled:
2025-05-07T20:33:12.9613862Z                 op = torch.compile(op)
2025-05-07T20:33:12.9613972Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9614049Z     
2025-05-07T20:33:12.9614151Z         y_fp8, y_scale = fn()
2025-05-07T20:33:12.9614276Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:12.9614347Z     
2025-05-07T20:33:12.9614492Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9614594Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:12.9614694Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:12.9614821Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:12.9614964Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9615042Z     
2025-05-07T20:33:12.9615148Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:12.9615153Z 
2025-05-07T20:33:12.9615251Z moe/activation_test.py:126: 
2025-05-07T20:33:12.9615389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9615496Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:12.9615633Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9616192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:12.9616293Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:12.9616647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9616874Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9617287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:12.9617545Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:12.9617917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:12.9618086Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:12.9618431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:12.9618509Z     fn()
2025-05-07T20:33:12.9618907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:12.9618991Z     self.fn.run(
2025-05-07T20:33:12.9619325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9619470Z     kernel = self.compile(
2025-05-07T20:33:12.9619850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9620033Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9620187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9620193Z 
2025-05-07T20:33:12.9620485Z self = <triton.compiler.compiler.ASTSource object at 0x7f057da5f7a0>
2025-05-07T20:33:12.9621261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9621755Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f057d9596c0>}
2025-05-07T20:33:12.9622591Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9622796Z context = <triton._C.libtriton.ir.context object at 0x7f057c52e670>
2025-05-07T20:33:12.9622801Z 
2025-05-07T20:33:12.9622967Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9623237Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9623348Z                            module_map=module_map)
2025-05-07T20:33:12.9623513Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9623623Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:12.9623702Z E       ^
2025-05-07T20:33:12.9624058Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9624063Z 
2025-05-07T20:33:12.9624477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9624481Z 
2025-05-07T20:33:12.9624584Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9624812Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9624891Z     T=16384,
2025-05-07T20:33:12.9624968Z     D=7168,
2025-05-07T20:33:12.9625065Z     scale_ub=1200.0,
2025-05-07T20:33:12.9625152Z     contiguous=False,
2025-05-07T20:33:12.9625246Z     compiled=False,
2025-05-07T20:33:12.9625323Z )
2025-05-07T20:33:12.9625538Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9625726Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:12.9625731Z 
2025-05-07T20:33:12.9625810Z     @given(
2025-05-07T20:33:12.9625930Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9626040Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9626205Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9626322Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9626446Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9626523Z     )
2025-05-07T20:33:12.9626777Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9626875Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9626955Z         self,
2025-05-07T20:33:12.9627041Z         T: int,
2025-05-07T20:33:12.9627120Z         D: int,
2025-05-07T20:33:12.9627221Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9627319Z         contiguous: bool,
2025-05-07T20:33:12.9627407Z         compiled: bool,
2025-05-07T20:33:12.9627488Z     ) -> None:
2025-05-07T20:33:12.9627588Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9627665Z     
2025-05-07T20:33:12.9627833Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9627965Z     
2025-05-07T20:33:12.9628061Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9628195Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9628286Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9628368Z         x0 = x[:, :D]
2025-05-07T20:33:12.9628457Z         x1 = x[:, D:]
2025-05-07T20:33:12.9628532Z     
2025-05-07T20:33:12.9628616Z         if contiguous:
2025-05-07T20:33:12.9628756Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9628845Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9628916Z     
2025-05-07T20:33:12.9629013Z         if scale_ub is not None:
2025-05-07T20:33:12.9629118Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9629252Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9629339Z             )
2025-05-07T20:33:12.9629417Z         else:
2025-05-07T20:33:12.9629522Z             scale_ub_tensor = None
2025-05-07T20:33:12.9629596Z     
2025-05-07T20:33:12.9629768Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9629871Z             op = silu_mul_quant
2025-05-07T20:33:12.9629958Z             if compiled:
2025-05-07T20:33:12.9630058Z                 op = torch.compile(op)
2025-05-07T20:33:12.9630171Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9630248Z     
2025-05-07T20:33:12.9630341Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.9630345Z 
2025-05-07T20:33:12.9630453Z moe/activation_test.py:117: 
2025-05-07T20:33:12.9630584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9630690Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.9630790Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9631284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.9631394Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.9631755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9631978Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9632319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9632419Z     kernel = self.compile(
2025-05-07T20:33:12.9632804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9632983Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9633110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9633115Z 
2025-05-07T20:33:12.9633327Z self = <triton.compiler.compiler.ASTSource object at 0x7f057da5dc70>
2025-05-07T20:33:12.9634098Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9634645Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f057c824720>}
2025-05-07T20:33:12.9635377Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9635568Z context = <triton._C.libtriton.ir.context object at 0x7f057c588270>
2025-05-07T20:33:12.9635580Z 
2025-05-07T20:33:12.9635744Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9636000Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9636118Z                            module_map=module_map)
2025-05-07T20:33:12.9636324Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9636429Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.9636517Z E       ^
2025-05-07T20:33:12.9636868Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9636873Z 
2025-05-07T20:33:12.9637289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9637333Z 
2025-05-07T20:33:12.9637438Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9637660Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9637750Z     T=1,
2025-05-07T20:33:12.9637828Z     D=7168,
2025-05-07T20:33:12.9637910Z     scale_ub=None,
2025-05-07T20:33:12.9638003Z     contiguous=True,
2025-05-07T20:33:12.9638086Z     compiled=True,
2025-05-07T20:33:12.9638158Z )
2025-05-07T20:33:12.9638421Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9638594Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:12.9638598Z 
2025-05-07T20:33:12.9638683Z     @given(
2025-05-07T20:33:12.9638805Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9638908Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9639030Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9639150Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9639267Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9639352Z     )
2025-05-07T20:33:12.9639600Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9639695Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9639781Z         self,
2025-05-07T20:33:12.9639863Z         T: int,
2025-05-07T20:33:12.9639950Z         D: int,
2025-05-07T20:33:12.9640070Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9640178Z         contiguous: bool,
2025-05-07T20:33:12.9640291Z         compiled: bool,
2025-05-07T20:33:12.9640371Z     ) -> None:
2025-05-07T20:33:12.9640465Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9640546Z     
2025-05-07T20:33:12.9640715Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9640791Z     
2025-05-07T20:33:12.9640891Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9641019Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9641110Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9641197Z         x0 = x[:, :D]
2025-05-07T20:33:12.9641283Z         x1 = x[:, D:]
2025-05-07T20:33:12.9641356Z     
2025-05-07T20:33:12.9641451Z         if contiguous:
2025-05-07T20:33:12.9641545Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9641643Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9641721Z     
2025-05-07T20:33:12.9641813Z         if scale_ub is not None:
2025-05-07T20:33:12.9641928Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9642112Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9642190Z             )
2025-05-07T20:33:12.9642275Z         else:
2025-05-07T20:33:12.9642371Z             scale_ub_tensor = None
2025-05-07T20:33:12.9642449Z     
2025-05-07T20:33:12.9642584Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9642673Z             op = silu_mul_quant
2025-05-07T20:33:12.9642761Z             if compiled:
2025-05-07T20:33:12.9642867Z                 op = torch.compile(op)
2025-05-07T20:33:12.9642975Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9643059Z     
2025-05-07T20:33:12.9643152Z         y_fp8, y_scale = fn()
2025-05-07T20:33:12.9643274Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:12.9643357Z     
2025-05-07T20:33:12.9643493Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9643596Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:12.9643752Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:12.9643877Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:12.9644019Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9644102Z     
2025-05-07T20:33:12.9644205Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:12.9644209Z 
2025-05-07T20:33:12.9644314Z moe/activation_test.py:126: 
2025-05-07T20:33:12.9644484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9644593Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:12.9644736Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9645291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:12.9645393Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:12.9645798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9646023Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9646393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:12.9646649Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:12.9647026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:12.9647199Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:12.9647535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:12.9647616Z     fn()
2025-05-07T20:33:12.9648011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:12.9648104Z     self.fn.run(
2025-05-07T20:33:12.9648442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9648539Z     kernel = self.compile(
2025-05-07T20:33:12.9648913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9649093Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9649223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9649228Z 
2025-05-07T20:33:12.9649437Z self = <triton.compiler.compiler.ASTSource object at 0x7f057cab73e0>
2025-05-07T20:33:12.9650198Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9650696Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f057c8242c0>}
2025-05-07T20:33:12.9651480Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9660333Z context = <triton._C.libtriton.ir.context object at 0x7f0553d6cb70>
2025-05-07T20:33:12.9660353Z 
2025-05-07T20:33:12.9660556Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9660831Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9660944Z                            module_map=module_map)
2025-05-07T20:33:12.9661110Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9661221Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:12.9661305Z E       ^
2025-05-07T20:33:12.9661862Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9661868Z 
2025-05-07T20:33:12.9662300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9662305Z 
2025-05-07T20:33:12.9662410Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9662709Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9662792Z     T=4096,
2025-05-07T20:33:12.9662873Z     D=5120,
2025-05-07T20:33:12.9662967Z     scale_ub=None,
2025-05-07T20:33:12.9663056Z     contiguous=False,
2025-05-07T20:33:12.9663143Z     compiled=False,
2025-05-07T20:33:12.9663226Z )
2025-05-07T20:33:12.9663443Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9663617Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:12.9663635Z 
2025-05-07T20:33:12.9663780Z     @given(
2025-05-07T20:33:12.9663904Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9664011Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9664129Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9664249Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9664371Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9664449Z     )
2025-05-07T20:33:12.9664694Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9664796Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9664873Z         self,
2025-05-07T20:33:12.9664957Z         T: int,
2025-05-07T20:33:12.9665035Z         D: int,
2025-05-07T20:33:12.9665137Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9665237Z         contiguous: bool,
2025-05-07T20:33:12.9665327Z         compiled: bool,
2025-05-07T20:33:12.9665409Z     ) -> None:
2025-05-07T20:33:12.9665520Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9665601Z     
2025-05-07T20:33:12.9665771Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9665852Z     
2025-05-07T20:33:12.9665949Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9666078Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9666180Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9666269Z         x0 = x[:, :D]
2025-05-07T20:33:12.9666355Z         x1 = x[:, D:]
2025-05-07T20:33:12.9666438Z     
2025-05-07T20:33:12.9666526Z         if contiguous:
2025-05-07T20:33:12.9666626Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9666719Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9666797Z     
2025-05-07T20:33:12.9666901Z         if scale_ub is not None:
2025-05-07T20:33:12.9667010Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9667147Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9667233Z             )
2025-05-07T20:33:12.9667388Z         else:
2025-05-07T20:33:12.9667489Z             scale_ub_tensor = None
2025-05-07T20:33:12.9667575Z     
2025-05-07T20:33:12.9667710Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9667804Z             op = silu_mul_quant
2025-05-07T20:33:12.9667900Z             if compiled:
2025-05-07T20:33:12.9668005Z                 op = torch.compile(op)
2025-05-07T20:33:12.9668127Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9668202Z     
2025-05-07T20:33:12.9668295Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.9668300Z 
2025-05-07T20:33:12.9668406Z moe/activation_test.py:117: 
2025-05-07T20:33:12.9668536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9668641Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.9668751Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9669297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.9669401Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.9669768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9670008Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9670387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9670525Z     kernel = self.compile(
2025-05-07T20:33:12.9670906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9671092Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9671223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9671228Z 
2025-05-07T20:33:12.9671452Z self = <triton.compiler.compiler.ASTSource object at 0x7f057c41c1a0>
2025-05-07T20:33:12.9672262Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9672762Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f057d937240>}
2025-05-07T20:33:12.9673508Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9673699Z context = <triton._C.libtriton.ir.context object at 0x7f0553b812f0>
2025-05-07T20:33:12.9673704Z 
2025-05-07T20:33:12.9673879Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9674148Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9674259Z                            module_map=module_map)
2025-05-07T20:33:12.9674430Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9674532Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.9674619Z E       ^
2025-05-07T20:33:12.9674971Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9674979Z 
2025-05-07T20:33:12.9675387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9675392Z 
2025-05-07T20:33:12.9675503Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9675726Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9675813Z     T=4096,
2025-05-07T20:33:12.9675893Z     D=7168,
2025-05-07T20:33:12.9675976Z     scale_ub=None,
2025-05-07T20:33:12.9676122Z     contiguous=False,
2025-05-07T20:33:12.9676209Z     compiled=False,
2025-05-07T20:33:12.9676287Z )
2025-05-07T20:33:12.9676511Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9676683Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:12.9676688Z 
2025-05-07T20:33:12.9676769Z     @given(
2025-05-07T20:33:12.9676896Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9677002Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9677127Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9677251Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9677370Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9677456Z     )
2025-05-07T20:33:12.9677701Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9677797Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9677920Z         self,
2025-05-07T20:33:12.9678007Z         T: int,
2025-05-07T20:33:12.9678085Z         D: int,
2025-05-07T20:33:12.9678196Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9678288Z         contiguous: bool,
2025-05-07T20:33:12.9678377Z         compiled: bool,
2025-05-07T20:33:12.9678467Z     ) -> None:
2025-05-07T20:33:12.9678561Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9678642Z     
2025-05-07T20:33:12.9678853Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9678929Z     
2025-05-07T20:33:12.9679029Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9679155Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9679244Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9679333Z         x0 = x[:, :D]
2025-05-07T20:33:12.9679416Z         x1 = x[:, D:]
2025-05-07T20:33:12.9679491Z     
2025-05-07T20:33:12.9679583Z         if contiguous:
2025-05-07T20:33:12.9679675Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9679812Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9679895Z     
2025-05-07T20:33:12.9679989Z         if scale_ub is not None:
2025-05-07T20:33:12.9680114Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9680277Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9680371Z             )
2025-05-07T20:33:12.9680458Z         else:
2025-05-07T20:33:12.9680555Z             scale_ub_tensor = None
2025-05-07T20:33:12.9680634Z     
2025-05-07T20:33:12.9680772Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9680865Z             op = silu_mul_quant
2025-05-07T20:33:12.9680951Z             if compiled:
2025-05-07T20:33:12.9681062Z                 op = torch.compile(op)
2025-05-07T20:33:12.9681169Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9681248Z     
2025-05-07T20:33:12.9681348Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.9681353Z 
2025-05-07T20:33:12.9681457Z moe/activation_test.py:117: 
2025-05-07T20:33:12.9681600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9681701Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.9681803Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9682311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.9682412Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.9682770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9683001Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9683339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9683445Z     kernel = self.compile(
2025-05-07T20:33:12.9683828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9684049Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9684185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9684189Z 
2025-05-07T20:33:12.9684394Z self = <triton.compiler.compiler.ASTSource object at 0x7f057c85b8f0>
2025-05-07T20:33:12.9685166Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9685666Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f057c1918a0>}
2025-05-07T20:33:12.9686446Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9686650Z context = <triton._C.libtriton.ir.context object at 0x7f0553be0670>
2025-05-07T20:33:12.9686654Z 
2025-05-07T20:33:12.9686823Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9687091Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9687243Z                            module_map=module_map)
2025-05-07T20:33:12.9687406Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9687519Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.9687600Z E       ^
2025-05-07T20:33:12.9687951Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9687966Z 
2025-05-07T20:33:12.9688374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9688425Z 
2025-05-07T20:33:12.9688532Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9688764Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9688845Z     T=128,
2025-05-07T20:33:12.9688928Z     D=7168,
2025-05-07T20:33:12.9689023Z     scale_ub=None,
2025-05-07T20:33:12.9689116Z     contiguous=False,
2025-05-07T20:33:12.9689203Z     compiled=True,
2025-05-07T20:33:12.9689291Z )
2025-05-07T20:33:12.9689507Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9689684Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:12.9689689Z 
2025-05-07T20:33:12.9689771Z     @given(
2025-05-07T20:33:12.9689891Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9689997Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9690113Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9690240Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9690369Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9690462Z     )
2025-05-07T20:33:12.9690732Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9690834Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9690915Z         self,
2025-05-07T20:33:12.9691004Z         T: int,
2025-05-07T20:33:12.9691082Z         D: int,
2025-05-07T20:33:12.9691182Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9691283Z         contiguous: bool,
2025-05-07T20:33:12.9691371Z         compiled: bool,
2025-05-07T20:33:12.9691451Z     ) -> None:
2025-05-07T20:33:12.9691555Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9691633Z     
2025-05-07T20:33:12.9691802Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9691884Z     
2025-05-07T20:33:12.9691978Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9692106Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9692277Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9692357Z         x0 = x[:, :D]
2025-05-07T20:33:12.9692445Z         x1 = x[:, D:]
2025-05-07T20:33:12.9692517Z     
2025-05-07T20:33:12.9692601Z         if contiguous:
2025-05-07T20:33:12.9692700Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9692793Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9692869Z     
2025-05-07T20:33:12.9693041Z         if scale_ub is not None:
2025-05-07T20:33:12.9693149Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9693289Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9693375Z             )
2025-05-07T20:33:12.9693454Z         else:
2025-05-07T20:33:12.9693551Z             scale_ub_tensor = None
2025-05-07T20:33:12.9693634Z     
2025-05-07T20:33:12.9693764Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9693862Z             op = silu_mul_quant
2025-05-07T20:33:12.9693993Z             if compiled:
2025-05-07T20:33:12.9694101Z                 op = torch.compile(op)
2025-05-07T20:33:12.9694215Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9694291Z     
2025-05-07T20:33:12.9694386Z         y_fp8, y_scale = fn()
2025-05-07T20:33:12.9694516Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:12.9694593Z     
2025-05-07T20:33:12.9694730Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9694884Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:12.9694986Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:12.9695109Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:12.9695260Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9695337Z     
2025-05-07T20:33:12.9695447Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:12.9695451Z 
2025-05-07T20:33:12.9695551Z moe/activation_test.py:126: 
2025-05-07T20:33:12.9695723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9695844Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:12.9695979Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9696535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:12.9696646Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:12.9697006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9697237Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9697600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:12.9697862Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:12.9698247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:12.9698418Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:12.9698761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:12.9698839Z     fn()
2025-05-07T20:33:12.9699236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:12.9699331Z     self.fn.run(
2025-05-07T20:33:12.9699666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9699764Z     kernel = self.compile(
2025-05-07T20:33:12.9700198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9700373Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9700518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9700566Z 
2025-05-07T20:33:12.9700774Z self = <triton.compiler.compiler.ASTSource object at 0x7f057c9c54c0>
2025-05-07T20:33:12.9701540Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9702049Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f057c0f5a80>}
2025-05-07T20:33:12.9702786Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9703023Z context = <triton._C.libtriton.ir.context object at 0x7f0553e3d3f0>
2025-05-07T20:33:12.9703033Z 
2025-05-07T20:33:12.9703201Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9703461Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9703581Z                            module_map=module_map)
2025-05-07T20:33:12.9703747Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9703987Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:12.9704064Z E       ^
2025-05-07T20:33:12.9704416Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9704421Z 
2025-05-07T20:33:12.9704835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9704839Z 
2025-05-07T20:33:12.9704946Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9705223Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9705305Z     T=128,
2025-05-07T20:33:12.9705383Z     D=7168,
2025-05-07T20:33:12.9705475Z     scale_ub=None,
2025-05-07T20:33:12.9705563Z     contiguous=False,
2025-05-07T20:33:12.9705650Z     compiled=False,
2025-05-07T20:33:12.9705734Z )
2025-05-07T20:33:12.9705950Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9706124Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:12.9706128Z 
2025-05-07T20:33:12.9706215Z     @given(
2025-05-07T20:33:12.9706337Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9706444Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9706560Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9706676Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9706798Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9706874Z     )
2025-05-07T20:33:12.9707128Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9707233Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9707311Z         self,
2025-05-07T20:33:12.9707388Z         T: int,
2025-05-07T20:33:12.9707474Z         D: int,
2025-05-07T20:33:12.9707576Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9707666Z         contiguous: bool,
2025-05-07T20:33:12.9707763Z         compiled: bool,
2025-05-07T20:33:12.9707849Z     ) -> None:
2025-05-07T20:33:12.9707954Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9708030Z     
2025-05-07T20:33:12.9708198Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9708280Z     
2025-05-07T20:33:12.9708375Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9708501Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9708598Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9708678Z         x0 = x[:, :D]
2025-05-07T20:33:12.9708762Z         x1 = x[:, D:]
2025-05-07T20:33:12.9708891Z     
2025-05-07T20:33:12.9708976Z         if contiguous:
2025-05-07T20:33:12.9709070Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9709166Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9709242Z     
2025-05-07T20:33:12.9709335Z         if scale_ub is not None:
2025-05-07T20:33:12.9709449Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9709588Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9709672Z             )
2025-05-07T20:33:12.9709748Z         else:
2025-05-07T20:33:12.9709852Z             scale_ub_tensor = None
2025-05-07T20:33:12.9709926Z     
2025-05-07T20:33:12.9710075Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9710177Z             op = silu_mul_quant
2025-05-07T20:33:12.9710283Z             if compiled:
2025-05-07T20:33:12.9710392Z                 op = torch.compile(op)
2025-05-07T20:33:12.9710496Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9710621Z     
2025-05-07T20:33:12.9710721Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.9710726Z 
2025-05-07T20:33:12.9710824Z moe/activation_test.py:117: 
2025-05-07T20:33:12.9710952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9711062Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.9711160Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9711697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.9711805Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.9712159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9712387Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9712723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9712859Z     kernel = self.compile(
2025-05-07T20:33:12.9713243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9713418Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9713555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9713562Z 
2025-05-07T20:33:12.9713767Z self = <triton.compiler.compiler.ASTSource object at 0x7f057c858da0>
2025-05-07T20:33:12.9714529Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9715034Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0553f30860>}
2025-05-07T20:33:12.9715773Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9715968Z context = <triton._C.libtriton.ir.context object at 0x7f0553e8a2b0>
2025-05-07T20:33:12.9715973Z 
2025-05-07T20:33:12.9716139Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9716402Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9716518Z                            module_map=module_map)
2025-05-07T20:33:12.9716678Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9716784Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.9716861Z E       ^
2025-05-07T20:33:12.9717212Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9717262Z 
2025-05-07T20:33:12.9717677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9717681Z 
2025-05-07T20:33:12.9717785Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9718010Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9718086Z     T=4096,
2025-05-07T20:33:12.9718169Z     D=5120,
2025-05-07T20:33:12.9718262Z     scale_ub=1200.0,
2025-05-07T20:33:12.9718348Z     contiguous=True,
2025-05-07T20:33:12.9718432Z     compiled=False,
2025-05-07T20:33:12.9718513Z )
2025-05-07T20:33:12.9718726Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9718900Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:12.9718904Z 
2025-05-07T20:33:12.9718987Z     @given(
2025-05-07T20:33:12.9719105Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9719252Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9719376Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9719495Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9719615Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9719689Z     )
2025-05-07T20:33:12.9719932Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9720086Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9720176Z         self,
2025-05-07T20:33:12.9720261Z         T: int,
2025-05-07T20:33:12.9720362Z         D: int,
2025-05-07T20:33:12.9720464Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9720554Z         contiguous: bool,
2025-05-07T20:33:12.9720645Z         compiled: bool,
2025-05-07T20:33:12.9720732Z     ) -> None:
2025-05-07T20:33:12.9720827Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9720905Z     
2025-05-07T20:33:12.9721142Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9721223Z     
2025-05-07T20:33:12.9721322Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9721448Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9721547Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9721630Z         x0 = x[:, :D]
2025-05-07T20:33:12.9721717Z         x1 = x[:, D:]
2025-05-07T20:33:12.9721799Z     
2025-05-07T20:33:12.9721885Z         if contiguous:
2025-05-07T20:33:12.9721980Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9722079Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9722154Z     
2025-05-07T20:33:12.9722247Z         if scale_ub is not None:
2025-05-07T20:33:12.9722359Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9722494Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9722568Z             )
2025-05-07T20:33:12.9722651Z         else:
2025-05-07T20:33:12.9722746Z             scale_ub_tensor = None
2025-05-07T20:33:12.9722817Z     
2025-05-07T20:33:12.9722960Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9723049Z             op = silu_mul_quant
2025-05-07T20:33:12.9723143Z             if compiled:
2025-05-07T20:33:12.9723243Z                 op = torch.compile(op)
2025-05-07T20:33:12.9723348Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9723427Z     
2025-05-07T20:33:12.9723519Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.9723526Z 
2025-05-07T20:33:12.9723621Z moe/activation_test.py:117: 
2025-05-07T20:33:12.9723764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9723864Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.9723963Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9724461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.9724561Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.9724923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9725191Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9725529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9725634Z     kernel = self.compile(
2025-05-07T20:33:12.9726012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9726194Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9726320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9726324Z 
2025-05-07T20:33:12.9726528Z self = <triton.compiler.compiler.ASTSource object at 0x7f057c858ec0>
2025-05-07T20:33:12.9727335Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9727833Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0553f314e0>}
2025-05-07T20:33:12.9728569Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9728798Z context = <triton._C.libtriton.ir.context object at 0x7f055323eff0>
2025-05-07T20:33:12.9728803Z 
2025-05-07T20:33:12.9728968Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9729231Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9729343Z                            module_map=module_map)
2025-05-07T20:33:12.9729551Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9729654Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.9729729Z E       ^
2025-05-07T20:33:12.9730086Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9730090Z 
2025-05-07T20:33:12.9730496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9730503Z 
2025-05-07T20:33:12.9730611Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9730832Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9730910Z     T=1,
2025-05-07T20:33:12.9730994Z     D=5120,
2025-05-07T20:33:12.9731078Z     scale_ub=None,
2025-05-07T20:33:12.9731165Z     contiguous=True,
2025-05-07T20:33:12.9731254Z     compiled=True,
2025-05-07T20:33:12.9731328Z )
2025-05-07T20:33:12.9731548Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9731715Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:12.9731720Z 
2025-05-07T20:33:12.9731794Z     @given(
2025-05-07T20:33:12.9731923Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9732021Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9732135Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9732262Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9732374Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9732446Z     )
2025-05-07T20:33:12.9732696Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9732789Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9732867Z         self,
2025-05-07T20:33:12.9732949Z         T: int,
2025-05-07T20:33:12.9733100Z         D: int,
2025-05-07T20:33:12.9733206Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9733370Z         contiguous: bool,
2025-05-07T20:33:12.9733457Z         compiled: bool,
2025-05-07T20:33:12.9733542Z     ) -> None:
2025-05-07T20:33:12.9733637Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9733712Z     
2025-05-07T20:33:12.9733886Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9733963Z     
2025-05-07T20:33:12.9734055Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9734194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9734283Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9734365Z         x0 = x[:, :D]
2025-05-07T20:33:12.9734452Z         x1 = x[:, D:]
2025-05-07T20:33:12.9734523Z     
2025-05-07T20:33:12.9734610Z         if contiguous:
2025-05-07T20:33:12.9734714Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9734805Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9734887Z     
2025-05-07T20:33:12.9734980Z         if scale_ub is not None:
2025-05-07T20:33:12.9735136Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9735280Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9735357Z             )
2025-05-07T20:33:12.9735433Z         else:
2025-05-07T20:33:12.9735537Z             scale_ub_tensor = None
2025-05-07T20:33:12.9735613Z     
2025-05-07T20:33:12.9735744Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9735888Z             op = silu_mul_quant
2025-05-07T20:33:12.9735978Z             if compiled:
2025-05-07T20:33:12.9736082Z                 op = torch.compile(op)
2025-05-07T20:33:12.9736191Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9736267Z     
2025-05-07T20:33:12.9736367Z         y_fp8, y_scale = fn()
2025-05-07T20:33:12.9736490Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:12.9736567Z     
2025-05-07T20:33:12.9736710Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9736816Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:12.9736959Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:12.9737094Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:12.9737238Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9737311Z     
2025-05-07T20:33:12.9737420Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:12.9737425Z 
2025-05-07T20:33:12.9737525Z moe/activation_test.py:126: 
2025-05-07T20:33:12.9737659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9737769Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:12.9737906Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9738466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:12.9738568Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:12.9738928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9739157Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9739519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:12.9739782Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:12.9740157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:12.9740324Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:12.9740663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:12.9740742Z     fn()
2025-05-07T20:33:12.9741146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:12.9741282Z     self.fn.run(
2025-05-07T20:33:12.9741616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9741719Z     kernel = self.compile(
2025-05-07T20:33:12.9742095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9742268Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9742407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9742411Z 
2025-05-07T20:33:12.9742614Z self = <triton.compiler.compiler.ASTSource object at 0x7f0553dc6b10>
2025-05-07T20:33:12.9743382Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9743922Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0553f32d40>}
2025-05-07T20:33:12.9744656Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9744892Z context = <triton._C.libtriton.ir.context object at 0x7f05531db170>
2025-05-07T20:33:12.9744897Z 
2025-05-07T20:33:12.9745062Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9745327Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9745436Z                            module_map=module_map)
2025-05-07T20:33:12.9745600Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9745707Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:12.9745832Z E       ^
2025-05-07T20:33:12.9746189Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9746194Z 
2025-05-07T20:33:12.9746605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9746610Z 
2025-05-07T20:33:12.9746713Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9746942Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9747019Z     T=2048,
2025-05-07T20:33:12.9747094Z     D=5120,
2025-05-07T20:33:12.9747182Z     scale_ub=None,
2025-05-07T20:33:12.9747266Z     contiguous=True,
2025-05-07T20:33:12.9747356Z     compiled=True,
2025-05-07T20:33:12.9747432Z )
2025-05-07T20:33:12.9747648Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9747825Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:12.9747834Z 
2025-05-07T20:33:12.9747912Z     @given(
2025-05-07T20:33:12.9748033Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9748137Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9748251Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9748368Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9748489Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9748569Z     )
2025-05-07T20:33:12.9748819Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9748913Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9748991Z         self,
2025-05-07T20:33:12.9749073Z         T: int,
2025-05-07T20:33:12.9749148Z         D: int,
2025-05-07T20:33:12.9749248Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9749344Z         contiguous: bool,
2025-05-07T20:33:12.9749431Z         compiled: bool,
2025-05-07T20:33:12.9749512Z     ) -> None:
2025-05-07T20:33:12.9749659Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9749734Z     
2025-05-07T20:33:12.9749904Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9749991Z     
2025-05-07T20:33:12.9750087Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9750216Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9750304Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9750389Z         x0 = x[:, :D]
2025-05-07T20:33:12.9750477Z         x1 = x[:, D:]
2025-05-07T20:33:12.9750553Z     
2025-05-07T20:33:12.9750639Z         if contiguous:
2025-05-07T20:33:12.9750741Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9750836Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9750913Z     
2025-05-07T20:33:12.9751010Z         if scale_ub is not None:
2025-05-07T20:33:12.9751116Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9751256Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9751382Z             )
2025-05-07T20:33:12.9751464Z         else:
2025-05-07T20:33:12.9751571Z             scale_ub_tensor = None
2025-05-07T20:33:12.9751643Z     
2025-05-07T20:33:12.9751774Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9751871Z             op = silu_mul_quant
2025-05-07T20:33:12.9751957Z             if compiled:
2025-05-07T20:33:12.9752059Z                 op = torch.compile(op)
2025-05-07T20:33:12.9752262Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9752335Z     
2025-05-07T20:33:12.9752426Z         y_fp8, y_scale = fn()
2025-05-07T20:33:12.9752552Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:12.9752629Z     
2025-05-07T20:33:12.9752766Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9752874Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:12.9752974Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:12.9753148Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:12.9753291Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9753366Z     
2025-05-07T20:33:12.9753473Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:12.9753477Z 
2025-05-07T20:33:12.9753581Z moe/activation_test.py:126: 
2025-05-07T20:33:12.9753712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9753825Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:12.9753961Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9754526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:12.9754630Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:12.9754984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9755215Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9755578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:12.9755832Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:12.9756206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:12.9756375Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:12.9756715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:12.9756794Z     fn()
2025-05-07T20:33:12.9757189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:12.9757281Z     self.fn.run(
2025-05-07T20:33:12.9757621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9757758Z     kernel = self.compile(
2025-05-07T20:33:12.9758140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9758313Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9758446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9758453Z 
2025-05-07T20:33:12.9758656Z self = <triton.compiler.compiler.ASTSource object at 0x7f055364f020>
2025-05-07T20:33:12.9759674Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9760275Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f057c1ede40>}
2025-05-07T20:33:12.9761013Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9761211Z context = <triton._C.libtriton.ir.context object at 0x7f055302abf0>
2025-05-07T20:33:12.9761217Z 
2025-05-07T20:33:12.9761444Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9761706Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9761812Z                            module_map=module_map)
2025-05-07T20:33:12.9761974Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9762081Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:12.9762157Z E       ^
2025-05-07T20:33:12.9762565Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9762572Z 
2025-05-07T20:33:12.9762986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9762991Z 
2025-05-07T20:33:12.9763098Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9763323Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9763405Z     T=128,
2025-05-07T20:33:12.9763482Z     D=5120,
2025-05-07T20:33:12.9763570Z     scale_ub=None,
2025-05-07T20:33:12.9763654Z     contiguous=True,
2025-05-07T20:33:12.9763735Z     compiled=True,
2025-05-07T20:33:12.9763811Z )
2025-05-07T20:33:12.9764025Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9764189Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:12.9764199Z 
2025-05-07T20:33:12.9764275Z     @given(
2025-05-07T20:33:12.9764395Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9764507Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9764622Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9764739Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9764858Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9764935Z     )
2025-05-07T20:33:12.9765178Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9765283Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9765358Z         self,
2025-05-07T20:33:12.9765434Z         T: int,
2025-05-07T20:33:12.9766074Z         D: int,
2025-05-07T20:33:12.9766175Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9766271Z         contiguous: bool,
2025-05-07T20:33:12.9766358Z         compiled: bool,
2025-05-07T20:33:12.9766434Z     ) -> None:
2025-05-07T20:33:12.9766539Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9766612Z     
2025-05-07T20:33:12.9766786Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9766933Z     
2025-05-07T20:33:12.9767025Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9767149Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9767243Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9767323Z         x0 = x[:, :D]
2025-05-07T20:33:12.9767403Z         x1 = x[:, D:]
2025-05-07T20:33:12.9767483Z     
2025-05-07T20:33:12.9767569Z         if contiguous:
2025-05-07T20:33:12.9767667Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9767758Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9767830Z     
2025-05-07T20:33:12.9767927Z         if scale_ub is not None:
2025-05-07T20:33:12.9768032Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9768169Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9768252Z             )
2025-05-07T20:33:12.9768327Z         else:
2025-05-07T20:33:12.9768419Z             scale_ub_tensor = None
2025-05-07T20:33:12.9768550Z     
2025-05-07T20:33:12.9768685Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9768776Z             op = silu_mul_quant
2025-05-07T20:33:12.9768866Z             if compiled:
2025-05-07T20:33:12.9768964Z                 op = torch.compile(op)
2025-05-07T20:33:12.9769075Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9769148Z     
2025-05-07T20:33:12.9769239Z         y_fp8, y_scale = fn()
2025-05-07T20:33:12.9769405Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:12.9769478Z     
2025-05-07T20:33:12.9769615Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9769718Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:12.9769817Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:12.9769936Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:12.9770077Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9770150Z     
2025-05-07T20:33:12.9770300Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:12.9770305Z 
2025-05-07T20:33:12.9770403Z moe/activation_test.py:126: 
2025-05-07T20:33:12.9770530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9770644Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:12.9770775Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9771329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:12.9771435Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:12.9771786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9772012Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9772377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:12.9772630Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:12.9773053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:12.9773219Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:12.9773559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:12.9773641Z     fn()
2025-05-07T20:33:12.9774034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:12.9774121Z     self.fn.run(
2025-05-07T20:33:12.9774453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9774545Z     kernel = self.compile(
2025-05-07T20:33:12.9774929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9775171Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9775317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9775321Z 
2025-05-07T20:33:12.9775551Z self = <triton.compiler.compiler.ASTSource object at 0x7f057da5e3f0>
2025-05-07T20:33:12.9776509Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9777127Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552f0dd00>}
2025-05-07T20:33:12.9778080Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9778303Z context = <triton._C.libtriton.ir.context object at 0x7f05535ffa70>
2025-05-07T20:33:12.9778308Z 
2025-05-07T20:33:12.9778489Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9778792Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9778955Z                            module_map=module_map)
2025-05-07T20:33:12.9779133Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9779244Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:12.9779322Z E       ^
2025-05-07T20:33:12.9779743Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9779747Z 
2025-05-07T20:33:12.9780284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9780292Z 
2025-05-07T20:33:12.9780399Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9780655Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9780732Z     T=4096,
2025-05-07T20:33:12.9780808Z     D=5120,
2025-05-07T20:33:12.9780897Z     scale_ub=None,
2025-05-07T20:33:12.9780982Z     contiguous=True,
2025-05-07T20:33:12.9781070Z     compiled=True,
2025-05-07T20:33:12.9781147Z )
2025-05-07T20:33:12.9781395Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9781579Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:12.9781583Z 
2025-05-07T20:33:12.9781663Z     @given(
2025-05-07T20:33:12.9781789Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9781892Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9782022Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9782151Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9782281Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9794986Z     )
2025-05-07T20:33:12.9795260Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9795354Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9795435Z         self,
2025-05-07T20:33:12.9795518Z         T: int,
2025-05-07T20:33:12.9795595Z         D: int,
2025-05-07T20:33:12.9795696Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9795784Z         contiguous: bool,
2025-05-07T20:33:12.9795866Z         compiled: bool,
2025-05-07T20:33:12.9795952Z     ) -> None:
2025-05-07T20:33:12.9796045Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9796117Z     
2025-05-07T20:33:12.9796292Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9796363Z     
2025-05-07T20:33:12.9796461Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9796739Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9796827Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9796911Z         x0 = x[:, :D]
2025-05-07T20:33:12.9796987Z         x1 = x[:, D:]
2025-05-07T20:33:12.9797060Z     
2025-05-07T20:33:12.9797148Z         if contiguous:
2025-05-07T20:33:12.9797237Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9797327Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9797403Z     
2025-05-07T20:33:12.9797493Z         if scale_ub is not None:
2025-05-07T20:33:12.9797596Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9797732Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9797802Z             )
2025-05-07T20:33:12.9797881Z         else:
2025-05-07T20:33:12.9797978Z             scale_ub_tensor = None
2025-05-07T20:33:12.9798046Z     
2025-05-07T20:33:12.9798180Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9798313Z             op = silu_mul_quant
2025-05-07T20:33:12.9798401Z             if compiled:
2025-05-07T20:33:12.9798514Z                 op = torch.compile(op)
2025-05-07T20:33:12.9798619Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9798691Z     
2025-05-07T20:33:12.9798789Z         y_fp8, y_scale = fn()
2025-05-07T20:33:12.9798911Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:12.9798980Z     
2025-05-07T20:33:12.9799165Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9799268Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:12.9799375Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:12.9799497Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:12.9799631Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9799707Z     
2025-05-07T20:33:12.9799804Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:12.9799809Z 
2025-05-07T20:33:12.9799917Z moe/activation_test.py:126: 
2025-05-07T20:33:12.9800130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9800244Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:12.9800376Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9800943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:12.9801045Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:12.9801399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9801622Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9801984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:12.9802241Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:12.9802611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:12.9802781Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:12.9803112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:12.9803188Z     fn()
2025-05-07T20:33:12.9803591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:12.9803672Z     self.fn.run(
2025-05-07T20:33:12.9804004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9804101Z     kernel = self.compile(
2025-05-07T20:33:12.9804475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9804653Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9804827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9804832Z 
2025-05-07T20:33:12.9805035Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552f3f680>
2025-05-07T20:33:12.9805803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9806296Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f055335b4c0>}
2025-05-07T20:33:12.9807031Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9807261Z context = <triton._C.libtriton.ir.context object at 0x7f0552b6eef0>
2025-05-07T20:33:12.9807268Z 
2025-05-07T20:33:12.9807441Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9807697Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9807802Z                            module_map=module_map)
2025-05-07T20:33:12.9807971Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9808119Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:12.9808196Z E       ^
2025-05-07T20:33:12.9808553Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9808558Z 
2025-05-07T20:33:12.9808961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9808966Z 
2025-05-07T20:33:12.9809070Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9809335Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9809415Z     T=16384,
2025-05-07T20:33:12.9809499Z     D=5120,
2025-05-07T20:33:12.9809583Z     scale_ub=None,
2025-05-07T20:33:12.9809668Z     contiguous=True,
2025-05-07T20:33:12.9809753Z     compiled=True,
2025-05-07T20:33:12.9809828Z )
2025-05-07T20:33:12.9810047Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9810232Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:12.9810238Z 
2025-05-07T20:33:12.9810333Z     @given(
2025-05-07T20:33:12.9810477Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9810589Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9810703Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9810822Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9810939Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9811021Z     )
2025-05-07T20:33:12.9811269Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9811363Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9811435Z         self,
2025-05-07T20:33:12.9811519Z         T: int,
2025-05-07T20:33:12.9811594Z         D: int,
2025-05-07T20:33:12.9811700Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9811795Z         contiguous: bool,
2025-05-07T20:33:12.9811878Z         compiled: bool,
2025-05-07T20:33:12.9811964Z     ) -> None:
2025-05-07T20:33:12.9812057Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9812133Z     
2025-05-07T20:33:12.9812312Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9812385Z     
2025-05-07T20:33:12.9812475Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9812607Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9812698Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9812780Z         x0 = x[:, :D]
2025-05-07T20:33:12.9812912Z         x1 = x[:, D:]
2025-05-07T20:33:12.9813058Z     
2025-05-07T20:33:12.9813143Z         if contiguous:
2025-05-07T20:33:12.9813243Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9813332Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9813412Z     
2025-05-07T20:33:12.9813503Z         if scale_ub is not None:
2025-05-07T20:33:12.9813609Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9813746Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9813817Z             )
2025-05-07T20:33:12.9813889Z         else:
2025-05-07T20:33:12.9813989Z             scale_ub_tensor = None
2025-05-07T20:33:12.9814060Z     
2025-05-07T20:33:12.9814186Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9814282Z             op = silu_mul_quant
2025-05-07T20:33:12.9814367Z             if compiled:
2025-05-07T20:33:12.9814466Z                 op = torch.compile(op)
2025-05-07T20:33:12.9814625Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9814701Z     
2025-05-07T20:33:12.9814798Z         y_fp8, y_scale = fn()
2025-05-07T20:33:12.9814919Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:12.9814988Z     
2025-05-07T20:33:12.9815126Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9815227Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:12.9815364Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:12.9815491Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:12.9815628Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9815702Z     
2025-05-07T20:33:12.9815805Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:12.9815810Z 
2025-05-07T20:33:12.9815907Z moe/activation_test.py:126: 
2025-05-07T20:33:12.9816041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9816147Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:12.9816322Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9816880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:12.9816982Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:12.9817335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9817564Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9817927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:12.9818189Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:12.9818562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:12.9818740Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:12.9819077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:12.9819155Z     fn()
2025-05-07T20:33:12.9819556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:12.9819640Z     self.fn.run(
2025-05-07T20:33:12.9819971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9820069Z     kernel = self.compile(
2025-05-07T20:33:12.9820466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9820667Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9820799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9820806Z 
2025-05-07T20:33:12.9821054Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552f3d1c0>
2025-05-07T20:33:12.9821822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9822315Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05528c9580>}
2025-05-07T20:33:12.9823052Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9823237Z context = <triton._C.libtriton.ir.context object at 0x7f055263b470>
2025-05-07T20:33:12.9823241Z 
2025-05-07T20:33:12.9823444Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9823707Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9823812Z                            module_map=module_map)
2025-05-07T20:33:12.9823970Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9824081Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:12.9824158Z E       ^
2025-05-07T20:33:12.9824556Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9824560Z 
2025-05-07T20:33:12.9824964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9824968Z 
2025-05-07T20:33:12.9825068Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9825291Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9825366Z     T=1,
2025-05-07T20:33:12.9825453Z     D=5120,
2025-05-07T20:33:12.9825600Z     scale_ub=1200.0,
2025-05-07T20:33:12.9825690Z     contiguous=True,
2025-05-07T20:33:12.9825779Z     compiled=True,
2025-05-07T20:33:12.9825853Z )
2025-05-07T20:33:12.9826069Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9826239Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:12.9826247Z 
2025-05-07T20:33:12.9826322Z     @given(
2025-05-07T20:33:12.9826442Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9826550Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9826663Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9826790Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9826902Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9826977Z     )
2025-05-07T20:33:12.9827224Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9827322Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9827399Z         self,
2025-05-07T20:33:12.9827481Z         T: int,
2025-05-07T20:33:12.9827555Z         D: int,
2025-05-07T20:33:12.9827649Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9827743Z         contiguous: bool,
2025-05-07T20:33:12.9827826Z         compiled: bool,
2025-05-07T20:33:12.9827901Z     ) -> None:
2025-05-07T20:33:12.9828002Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9828077Z     
2025-05-07T20:33:12.9828247Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9828321Z     
2025-05-07T20:33:12.9828413Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9828543Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9828634Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9828712Z         x0 = x[:, :D]
2025-05-07T20:33:12.9828797Z         x1 = x[:, D:]
2025-05-07T20:33:12.9828873Z     
2025-05-07T20:33:12.9828955Z         if contiguous:
2025-05-07T20:33:12.9829110Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9829199Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9829269Z     
2025-05-07T20:33:12.9829367Z         if scale_ub is not None:
2025-05-07T20:33:12.9829473Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9829604Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9829682Z             )
2025-05-07T20:33:12.9829759Z         else:
2025-05-07T20:33:12.9829855Z             scale_ub_tensor = None
2025-05-07T20:33:12.9829924Z     
2025-05-07T20:33:12.9830050Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9830143Z             op = silu_mul_quant
2025-05-07T20:33:12.9830229Z             if compiled:
2025-05-07T20:33:12.9830328Z                 op = torch.compile(op)
2025-05-07T20:33:12.9830440Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9830509Z     
2025-05-07T20:33:12.9830600Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.9830650Z 
2025-05-07T20:33:12.9830749Z moe/activation_test.py:117: 
2025-05-07T20:33:12.9830875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9830981Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.9831079Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9831440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:12.9831578Z     return fn(*args, **kwargs)
2025-05-07T20:33:12.9832063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.9832162Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.9832520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9832739Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9833118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9833214Z     kernel = self.compile(
2025-05-07T20:33:12.9833591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9833768Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9833896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9833901Z 
2025-05-07T20:33:12.9834109Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552dafc50>
2025-05-07T20:33:12.9834869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9835364Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0553854860>}
2025-05-07T20:33:12.9836100Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9836287Z context = <triton._C.libtriton.ir.context object at 0x7f05521f6370>
2025-05-07T20:33:12.9836294Z 
2025-05-07T20:33:12.9836460Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9836714Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9836820Z                            module_map=module_map)
2025-05-07T20:33:12.9836982Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9837077Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.9837153Z E       ^
2025-05-07T20:33:12.9837508Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9837555Z 
2025-05-07T20:33:12.9837959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9837964Z 
2025-05-07T20:33:12.9838069Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9838284Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9838364Z     T=1,
2025-05-07T20:33:12.9838445Z     D=5120,
2025-05-07T20:33:12.9838526Z     scale_ub=None,
2025-05-07T20:33:12.9838617Z     contiguous=False,
2025-05-07T20:33:12.9838701Z     compiled=True,
2025-05-07T20:33:12.9838775Z )
2025-05-07T20:33:12.9838991Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9839152Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:12.9839156Z 
2025-05-07T20:33:12.9839233Z     @given(
2025-05-07T20:33:12.9839404Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9839504Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9839617Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9839738Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9839851Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9839924Z     )
2025-05-07T20:33:12.9840247Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9840348Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9840434Z         self,
2025-05-07T20:33:12.9840510Z         T: int,
2025-05-07T20:33:12.9840585Z         D: int,
2025-05-07T20:33:12.9840686Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9840773Z         contiguous: bool,
2025-05-07T20:33:12.9840854Z         compiled: bool,
2025-05-07T20:33:12.9840936Z     ) -> None:
2025-05-07T20:33:12.9841028Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9841103Z     
2025-05-07T20:33:12.9841314Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9841389Z     
2025-05-07T20:33:12.9841481Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9841612Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9841701Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9841787Z         x0 = x[:, :D]
2025-05-07T20:33:12.9841863Z         x1 = x[:, D:]
2025-05-07T20:33:12.9841939Z     
2025-05-07T20:33:12.9842029Z         if contiguous:
2025-05-07T20:33:12.9842119Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9842209Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9842285Z     
2025-05-07T20:33:12.9842376Z         if scale_ub is not None:
2025-05-07T20:33:12.9842480Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9842617Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9842692Z             )
2025-05-07T20:33:12.9842768Z         else:
2025-05-07T20:33:12.9842875Z             scale_ub_tensor = None
2025-05-07T20:33:12.9842946Z     
2025-05-07T20:33:12.9843077Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9843165Z             op = silu_mul_quant
2025-05-07T20:33:12.9843247Z             if compiled:
2025-05-07T20:33:12.9843352Z                 op = torch.compile(op)
2025-05-07T20:33:12.9843456Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9843530Z     
2025-05-07T20:33:12.9843617Z         y_fp8, y_scale = fn()
2025-05-07T20:33:12.9843744Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:12.9843811Z     
2025-05-07T20:33:12.9843944Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9844048Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:12.9844145Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:12.9844267Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:12.9844404Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9844524Z     
2025-05-07T20:33:12.9844630Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:12.9844635Z 
2025-05-07T20:33:12.9844734Z moe/activation_test.py:126: 
2025-05-07T20:33:12.9844859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9844969Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:12.9845099Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9845649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:12.9845749Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:12.9846102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9846324Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9846725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:12.9846980Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:12.9847352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:12.9847517Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:12.9847894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:12.9847969Z     fn()
2025-05-07T20:33:12.9848360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:12.9848445Z     self.fn.run(
2025-05-07T20:33:12.9848776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9848876Z     kernel = self.compile(
2025-05-07T20:33:12.9849296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9849472Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9849601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9849606Z 
2025-05-07T20:33:12.9849807Z self = <triton.compiler.compiler.ASTSource object at 0x7f05528f40e0>
2025-05-07T20:33:12.9850599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9851122Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0553856b60>}
2025-05-07T20:33:12.9851853Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9852049Z context = <triton._C.libtriton.ir.context object at 0x7f05521e7170>
2025-05-07T20:33:12.9852053Z 
2025-05-07T20:33:12.9852216Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9852479Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9852586Z                            module_map=module_map)
2025-05-07T20:33:12.9852748Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9852852Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:12.9852933Z E       ^
2025-05-07T20:33:12.9853335Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9853340Z 
2025-05-07T20:33:12.9853751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9853799Z 
2025-05-07T20:33:12.9853899Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9854121Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9854196Z     T=1,
2025-05-07T20:33:12.9854270Z     D=5120,
2025-05-07T20:33:12.9854354Z     scale_ub=None,
2025-05-07T20:33:12.9854437Z     contiguous=True,
2025-05-07T20:33:12.9854517Z     compiled=False,
2025-05-07T20:33:12.9854590Z )
2025-05-07T20:33:12.9854805Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9854964Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:12.9854978Z 
2025-05-07T20:33:12.9855050Z     @given(
2025-05-07T20:33:12.9855168Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9855270Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9855429Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9855546Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9855670Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9855747Z     )
2025-05-07T20:33:12.9855989Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9856091Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9856231Z         self,
2025-05-07T20:33:12.9856306Z         T: int,
2025-05-07T20:33:12.9856384Z         D: int,
2025-05-07T20:33:12.9856479Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9856568Z         contiguous: bool,
2025-05-07T20:33:12.9856654Z         compiled: bool,
2025-05-07T20:33:12.9856732Z     ) -> None:
2025-05-07T20:33:12.9856829Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9856905Z     
2025-05-07T20:33:12.9857072Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9857153Z     
2025-05-07T20:33:12.9857285Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9857410Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9857502Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9857581Z         x0 = x[:, :D]
2025-05-07T20:33:12.9857659Z         x1 = x[:, D:]
2025-05-07T20:33:12.9857731Z     
2025-05-07T20:33:12.9857814Z         if contiguous:
2025-05-07T20:33:12.9857906Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9858002Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9858074Z     
2025-05-07T20:33:12.9858165Z         if scale_ub is not None:
2025-05-07T20:33:12.9858274Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9858406Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9858485Z             )
2025-05-07T20:33:12.9858560Z         else:
2025-05-07T20:33:12.9858655Z             scale_ub_tensor = None
2025-05-07T20:33:12.9858733Z     
2025-05-07T20:33:12.9858863Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9858959Z             op = silu_mul_quant
2025-05-07T20:33:12.9859047Z             if compiled:
2025-05-07T20:33:12.9859145Z                 op = torch.compile(op)
2025-05-07T20:33:12.9859469Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9859581Z     
2025-05-07T20:33:12.9859697Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.9859703Z 
2025-05-07T20:33:12.9859803Z moe/activation_test.py:117: 
2025-05-07T20:33:12.9859935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9860038Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.9860140Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9860631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.9860726Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.9861088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9861435Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9861838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9861934Z     kernel = self.compile(
2025-05-07T20:33:12.9862385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9862582Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9862717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9862722Z 
2025-05-07T20:33:12.9862952Z self = <triton.compiler.compiler.ASTSource object at 0x7f057c1e7d40>
2025-05-07T20:33:12.9863976Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9864591Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05538579c0>}
2025-05-07T20:33:12.9865505Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9865777Z context = <triton._C.libtriton.ir.context object at 0x7f05521b9430>
2025-05-07T20:33:12.9865782Z 
2025-05-07T20:33:12.9865968Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9866268Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9866379Z                            module_map=module_map)
2025-05-07T20:33:12.9866563Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9866722Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.9866804Z E       ^
2025-05-07T20:33:12.9867159Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9867163Z 
2025-05-07T20:33:12.9867566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9867573Z 
2025-05-07T20:33:12.9867676Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9867894Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9867969Z     T=128,
2025-05-07T20:33:12.9868050Z     D=5120,
2025-05-07T20:33:12.9868130Z     scale_ub=None,
2025-05-07T20:33:12.9868215Z     contiguous=False,
2025-05-07T20:33:12.9868297Z     compiled=True,
2025-05-07T20:33:12.9868367Z )
2025-05-07T20:33:12.9868585Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9868759Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:12.9868764Z 
2025-05-07T20:33:12.9868834Z     @given(
2025-05-07T20:33:12.9868955Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9869055Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9869169Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9869288Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9869398Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9869472Z     )
2025-05-07T20:33:12.9869717Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9869809Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9869888Z         self,
2025-05-07T20:33:12.9869979Z         T: int,
2025-05-07T20:33:12.9870056Z         D: int,
2025-05-07T20:33:12.9870177Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9870267Z         contiguous: bool,
2025-05-07T20:33:12.9870400Z         compiled: bool,
2025-05-07T20:33:12.9870479Z     ) -> None:
2025-05-07T20:33:12.9870574Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9870646Z     
2025-05-07T20:33:12.9870814Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9870887Z     
2025-05-07T20:33:12.9870975Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9871103Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9871192Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9871274Z         x0 = x[:, :D]
2025-05-07T20:33:12.9871354Z         x1 = x[:, D:]
2025-05-07T20:33:12.9871424Z     
2025-05-07T20:33:12.9871512Z         if contiguous:
2025-05-07T20:33:12.9871601Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9871690Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9871768Z     
2025-05-07T20:33:12.9871857Z         if scale_ub is not None:
2025-05-07T20:33:12.9871958Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9872145Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9872224Z             )
2025-05-07T20:33:12.9872300Z         else:
2025-05-07T20:33:12.9872396Z             scale_ub_tensor = None
2025-05-07T20:33:12.9872466Z     
2025-05-07T20:33:12.9872591Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9872683Z             op = silu_mul_quant
2025-05-07T20:33:12.9872805Z             if compiled:
2025-05-07T20:33:12.9872909Z                 op = torch.compile(op)
2025-05-07T20:33:12.9873010Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9873080Z     
2025-05-07T20:33:12.9873174Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.9873178Z 
2025-05-07T20:33:12.9873274Z moe/activation_test.py:117: 
2025-05-07T20:33:12.9873398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9873503Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.9873600Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9874005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:12.9874097Z     return fn(*args, **kwargs)
2025-05-07T20:33:12.9874581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.9874679Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.9875029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9875248Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9875585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9875676Z     kernel = self.compile(
2025-05-07T20:33:12.9876056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9876231Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9876354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9876359Z 
2025-05-07T20:33:12.9876565Z self = <triton.compiler.compiler.ASTSource object at 0x7f0553818920>
2025-05-07T20:33:12.9877323Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9877823Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0553854ea0>}
2025-05-07T20:33:12.9878561Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9878824Z context = <triton._C.libtriton.ir.context object at 0x7f0552009b70>
2025-05-07T20:33:12.9878833Z 
2025-05-07T20:33:12.9878997Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9879253Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9879362Z                            module_map=module_map)
2025-05-07T20:33:12.9879525Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9879621Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.9879701Z E       ^
2025-05-07T20:33:12.9880048Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9880053Z 
2025-05-07T20:33:12.9880486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9880537Z 
2025-05-07T20:33:12.9880660Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9880878Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9880962Z     T=128,
2025-05-07T20:33:12.9881039Z     D=7168,
2025-05-07T20:33:12.9881118Z     scale_ub=1200.0,
2025-05-07T20:33:12.9881204Z     contiguous=False,
2025-05-07T20:33:12.9881284Z     compiled=False,
2025-05-07T20:33:12.9881393Z )
2025-05-07T20:33:12.9881613Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9881780Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:12.9881785Z 
2025-05-07T20:33:12.9881863Z     @given(
2025-05-07T20:33:12.9881979Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9882078Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9882195Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9882313Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9882468Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9882546Z     )
2025-05-07T20:33:12.9882787Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9882878Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9882954Z         self,
2025-05-07T20:33:12.9883025Z         T: int,
2025-05-07T20:33:12.9883101Z         D: int,
2025-05-07T20:33:12.9883199Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9883285Z         contiguous: bool,
2025-05-07T20:33:12.9883370Z         compiled: bool,
2025-05-07T20:33:12.9883444Z     ) -> None:
2025-05-07T20:33:12.9883535Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9883614Z     
2025-05-07T20:33:12.9883778Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9883850Z     
2025-05-07T20:33:12.9883944Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9884065Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9884159Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9884244Z         x0 = x[:, :D]
2025-05-07T20:33:12.9884321Z         x1 = x[:, D:]
2025-05-07T20:33:12.9884398Z     
2025-05-07T20:33:12.9884479Z         if contiguous:
2025-05-07T20:33:12.9884566Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9884654Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9884727Z     
2025-05-07T20:33:12.9884817Z         if scale_ub is not None:
2025-05-07T20:33:12.9884928Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9885060Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9885135Z             )
2025-05-07T20:33:12.9885216Z         else:
2025-05-07T20:33:12.9885307Z             scale_ub_tensor = None
2025-05-07T20:33:12.9885381Z     
2025-05-07T20:33:12.9885511Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9885599Z             op = silu_mul_quant
2025-05-07T20:33:12.9885677Z             if compiled:
2025-05-07T20:33:12.9885835Z                 op = torch.compile(op)
2025-05-07T20:33:12.9885936Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9886015Z     
2025-05-07T20:33:12.9886102Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.9886106Z 
2025-05-07T20:33:12.9886202Z moe/activation_test.py:117: 
2025-05-07T20:33:12.9886333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9886435Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.9886531Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9887019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.9887112Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.9887467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9887752Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9888089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9888183Z     kernel = self.compile(
2025-05-07T20:33:12.9888557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9888727Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9888895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9888899Z 
2025-05-07T20:33:12.9889099Z self = <triton.compiler.compiler.ASTSource object at 0x7f05538185f0>
2025-05-07T20:33:12.9889859Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9890388Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0553fb3e20>}
2025-05-07T20:33:12.9891123Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9891308Z context = <triton._C.libtriton.ir.context object at 0x7f0552049b30>
2025-05-07T20:33:12.9891315Z 
2025-05-07T20:33:12.9891477Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9891736Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9891839Z                            module_map=module_map)
2025-05-07T20:33:12.9892002Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9892098Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.9892172Z E       ^
2025-05-07T20:33:12.9892528Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9892532Z 
2025-05-07T20:33:12.9892935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9892939Z 
2025-05-07T20:33:12.9893093Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9893319Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9893393Z     T=128,
2025-05-07T20:33:12.9893474Z     D=5120,
2025-05-07T20:33:12.9893555Z     scale_ub=None,
2025-05-07T20:33:12.9893638Z     contiguous=False,
2025-05-07T20:33:12.9893725Z     compiled=False,
2025-05-07T20:33:12.9893795Z )
2025-05-07T20:33:12.9894008Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9894174Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:12.9894179Z 
2025-05-07T20:33:12.9894308Z     @given(
2025-05-07T20:33:12.9894426Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9894529Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9894639Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9894756Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9894868Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9894941Z     )
2025-05-07T20:33:12.9895187Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9895277Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9895352Z         self,
2025-05-07T20:33:12.9895426Z         T: int,
2025-05-07T20:33:12.9895499Z         D: int,
2025-05-07T20:33:12.9895597Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9895691Z         contiguous: bool,
2025-05-07T20:33:12.9895775Z         compiled: bool,
2025-05-07T20:33:12.9895850Z     ) -> None:
2025-05-07T20:33:12.9895990Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9896063Z     
2025-05-07T20:33:12.9896233Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9896301Z     
2025-05-07T20:33:12.9896390Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9896518Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9896605Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9896682Z         x0 = x[:, :D]
2025-05-07T20:33:12.9896807Z         x1 = x[:, D:]
2025-05-07T20:33:12.9896874Z     
2025-05-07T20:33:12.9896955Z         if contiguous:
2025-05-07T20:33:12.9897052Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9897140Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9897210Z     
2025-05-07T20:33:12.9897302Z         if scale_ub is not None:
2025-05-07T20:33:12.9897404Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9897537Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9897608Z             )
2025-05-07T20:33:12.9897723Z         else:
2025-05-07T20:33:12.9897821Z             scale_ub_tensor = None
2025-05-07T20:33:12.9897892Z     
2025-05-07T20:33:12.9898018Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9898111Z             op = silu_mul_quant
2025-05-07T20:33:12.9898191Z             if compiled:
2025-05-07T20:33:12.9898285Z                 op = torch.compile(op)
2025-05-07T20:33:12.9898387Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9898455Z     
2025-05-07T20:33:12.9898541Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.9898546Z 
2025-05-07T20:33:12.9898642Z moe/activation_test.py:117: 
2025-05-07T20:33:12.9898766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9898871Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.9898966Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9899462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.9899561Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.9899916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9900150Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9900519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9900613Z     kernel = self.compile(
2025-05-07T20:33:12.9900990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9901164Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9901285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9901290Z 
2025-05-07T20:33:12.9901495Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552039790>
2025-05-07T20:33:12.9902299Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9902791Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f055385c400>}
2025-05-07T20:33:12.9903521Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9903710Z context = <triton._C.libtriton.ir.context object at 0x7f05527e3eb0>
2025-05-07T20:33:12.9903714Z 
2025-05-07T20:33:12.9903876Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9904171Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9904285Z                            module_map=module_map)
2025-05-07T20:33:12.9904444Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9904539Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.9904618Z E       ^
2025-05-07T20:33:12.9904962Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9905004Z 
2025-05-07T20:33:12.9905413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9905417Z 
2025-05-07T20:33:12.9905516Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9905734Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9905813Z     T=128,
2025-05-07T20:33:12.9905889Z     D=5120,
2025-05-07T20:33:12.9905969Z     scale_ub=1200.0,
2025-05-07T20:33:12.9906096Z     contiguous=True,
2025-05-07T20:33:12.9906180Z     compiled=False,
2025-05-07T20:33:12.9906255Z )
2025-05-07T20:33:12.9906469Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9906636Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:12.9906641Z 
2025-05-07T20:33:12.9906722Z     @given(
2025-05-07T20:33:12.9906837Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9906935Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9907049Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9907163Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9907272Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9907346Z     )
2025-05-07T20:33:12.9907585Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9907678Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9907753Z         self,
2025-05-07T20:33:12.9907827Z         T: int,
2025-05-07T20:33:12.9907907Z         D: int,
2025-05-07T20:33:12.9908000Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9908085Z         contiguous: bool,
2025-05-07T20:33:12.9908172Z         compiled: bool,
2025-05-07T20:33:12.9908247Z     ) -> None:
2025-05-07T20:33:12.9908339Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9908413Z     
2025-05-07T20:33:12.9908580Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9908651Z     
2025-05-07T20:33:12.9908745Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9908866Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9908954Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9909032Z         x0 = x[:, :D]
2025-05-07T20:33:12.9909109Z         x1 = x[:, D:]
2025-05-07T20:33:12.9909185Z     
2025-05-07T20:33:12.9909264Z         if contiguous:
2025-05-07T20:33:12.9909349Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9909441Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9909559Z     
2025-05-07T20:33:12.9909647Z         if scale_ub is not None:
2025-05-07T20:33:12.9909755Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9909884Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9909955Z             )
2025-05-07T20:33:12.9910030Z         else:
2025-05-07T20:33:12.9910126Z             scale_ub_tensor = None
2025-05-07T20:33:12.9910203Z     
2025-05-07T20:33:12.9910356Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9910461Z             op = silu_mul_quant
2025-05-07T20:33:12.9910554Z             if compiled:
2025-05-07T20:33:12.9910650Z                 op = torch.compile(op)
2025-05-07T20:33:12.9910752Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9910827Z     
2025-05-07T20:33:12.9910915Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.9910920Z 
2025-05-07T20:33:12.9911014Z moe/activation_test.py:117: 
2025-05-07T20:33:12.9911184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9911287Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.9911383Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9911872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.9911967Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.9912361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9912577Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9912906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9913000Z     kernel = self.compile(
2025-05-07T20:33:12.9913376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9913588Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9913714Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9913719Z 
2025-05-07T20:33:12.9917362Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552039d90>
2025-05-07T20:33:12.9918128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9918629Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f055385d300>}
2025-05-07T20:33:12.9919364Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9919555Z context = <triton._C.libtriton.ir.context object at 0x7f05523ae030>
2025-05-07T20:33:12.9919564Z 
2025-05-07T20:33:12.9919726Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9919999Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9920126Z                            module_map=module_map)
2025-05-07T20:33:12.9920304Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9920407Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.9920487Z E       ^
2025-05-07T20:33:12.9920833Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9920838Z 
2025-05-07T20:33:12.9921245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9921340Z 
2025-05-07T20:33:12.9921441Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9921659Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9921736Z     T=1,
2025-05-07T20:33:12.9921810Z     D=7168,
2025-05-07T20:33:12.9921892Z     scale_ub=1200.0,
2025-05-07T20:33:12.9921979Z     contiguous=True,
2025-05-07T20:33:12.9922062Z     compiled=True,
2025-05-07T20:33:12.9922136Z )
2025-05-07T20:33:12.9922352Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9922511Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:12.9922516Z 
2025-05-07T20:33:12.9922595Z     @given(
2025-05-07T20:33:12.9922709Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9922806Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9922919Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9923074Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9923188Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9923265Z     )
2025-05-07T20:33:12.9923504Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9923594Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9923670Z         self,
2025-05-07T20:33:12.9923743Z         T: int,
2025-05-07T20:33:12.9923817Z         D: int,
2025-05-07T20:33:12.9923954Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9924041Z         contiguous: bool,
2025-05-07T20:33:12.9924125Z         compiled: bool,
2025-05-07T20:33:12.9924203Z     ) -> None:
2025-05-07T20:33:12.9924295Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9924370Z     
2025-05-07T20:33:12.9924534Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9924605Z     
2025-05-07T20:33:12.9924696Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9924821Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9924990Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9925075Z         x0 = x[:, :D]
2025-05-07T20:33:12.9925152Z         x1 = x[:, D:]
2025-05-07T20:33:12.9925224Z     
2025-05-07T20:33:12.9925306Z         if contiguous:
2025-05-07T20:33:12.9925392Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9925479Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9925547Z     
2025-05-07T20:33:12.9925638Z         if scale_ub is not None:
2025-05-07T20:33:12.9925742Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9925872Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9925944Z             )
2025-05-07T20:33:12.9926018Z         else:
2025-05-07T20:33:12.9926107Z             scale_ub_tensor = None
2025-05-07T20:33:12.9926177Z     
2025-05-07T20:33:12.9926310Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9926397Z             op = silu_mul_quant
2025-05-07T20:33:12.9926477Z             if compiled:
2025-05-07T20:33:12.9926582Z                 op = torch.compile(op)
2025-05-07T20:33:12.9926684Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9926753Z     
2025-05-07T20:33:12.9926840Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.9926844Z 
2025-05-07T20:33:12.9926937Z moe/activation_test.py:117: 
2025-05-07T20:33:12.9927065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9927164Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.9927257Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9927621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:12.9927714Z     return fn(*args, **kwargs)
2025-05-07T20:33:12.9928195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.9928286Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.9928638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9928911Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9929241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9929330Z     kernel = self.compile(
2025-05-07T20:33:12.9929713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9929885Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9930009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9930014Z 
2025-05-07T20:33:12.9930217Z self = <triton.compiler.compiler.ASTSource object at 0x7f055203a450>
2025-05-07T20:33:12.9931015Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9931518Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f055385eac0>}
2025-05-07T20:33:12.9932245Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9932472Z context = <triton._C.libtriton.ir.context object at 0x7f05527c79b0>
2025-05-07T20:33:12.9932476Z 
2025-05-07T20:33:12.9932637Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9932895Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9933068Z                            module_map=module_map)
2025-05-07T20:33:12.9933271Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9933369Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.9933446Z E       ^
2025-05-07T20:33:12.9933792Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9933797Z 
2025-05-07T20:33:12.9934200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9934207Z 
2025-05-07T20:33:12.9934308Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9934527Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9934600Z     T=1,
2025-05-07T20:33:12.9934670Z     D=7168,
2025-05-07T20:33:12.9934753Z     scale_ub=1200.0,
2025-05-07T20:33:12.9934837Z     contiguous=False,
2025-05-07T20:33:12.9934917Z     compiled=True,
2025-05-07T20:33:12.9934992Z )
2025-05-07T20:33:12.9935210Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9935371Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:12.9935382Z 
2025-05-07T20:33:12.9935456Z     @given(
2025-05-07T20:33:12.9935571Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9935672Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9935780Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9935896Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9936008Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9936078Z     )
2025-05-07T20:33:12.9936318Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9936411Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9936481Z         self,
2025-05-07T20:33:12.9936555Z         T: int,
2025-05-07T20:33:12.9936630Z         D: int,
2025-05-07T20:33:12.9936730Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9936866Z         contiguous: bool,
2025-05-07T20:33:12.9936950Z         compiled: bool,
2025-05-07T20:33:12.9937024Z     ) -> None:
2025-05-07T20:33:12.9937119Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9937190Z     
2025-05-07T20:33:12.9937353Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9937430Z     
2025-05-07T20:33:12.9937517Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9937640Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9937728Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9937803Z         x0 = x[:, :D]
2025-05-07T20:33:12.9937882Z         x1 = x[:, D:]
2025-05-07T20:33:12.9937951Z     
2025-05-07T20:33:12.9938030Z         if contiguous:
2025-05-07T20:33:12.9938126Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9938213Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9938283Z     
2025-05-07T20:33:12.9938372Z         if scale_ub is not None:
2025-05-07T20:33:12.9938519Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9938651Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9938729Z             )
2025-05-07T20:33:12.9938800Z         else:
2025-05-07T20:33:12.9938895Z             scale_ub_tensor = None
2025-05-07T20:33:12.9938967Z     
2025-05-07T20:33:12.9939093Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9939219Z             op = silu_mul_quant
2025-05-07T20:33:12.9939305Z             if compiled:
2025-05-07T20:33:12.9939400Z                 op = torch.compile(op)
2025-05-07T20:33:12.9939499Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9939570Z     
2025-05-07T20:33:12.9939655Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.9939659Z 
2025-05-07T20:33:12.9939756Z moe/activation_test.py:117: 
2025-05-07T20:33:12.9939879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9939978Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.9940121Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9940532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:12.9940622Z     return fn(*args, **kwargs)
2025-05-07T20:33:12.9941105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.9941203Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.9941552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9941770Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9942098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9942189Z     kernel = self.compile(
2025-05-07T20:33:12.9942564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9942740Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9942863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9942868Z 
2025-05-07T20:33:12.9943067Z self = <triton.compiler.compiler.ASTSource object at 0x7f055203bc20>
2025-05-07T20:33:12.9943826Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9944317Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05538079c0>}
2025-05-07T20:33:12.9945047Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9945271Z context = <triton._C.libtriton.ir.context object at 0x7f0552e0e370>
2025-05-07T20:33:12.9945276Z 
2025-05-07T20:33:12.9945435Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9945693Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9945800Z                            module_map=module_map)
2025-05-07T20:33:12.9945964Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9946060Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.9946136Z E       ^
2025-05-07T20:33:12.9946484Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9946488Z 
2025-05-07T20:33:12.9946933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9946941Z 
2025-05-07T20:33:12.9947042Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9947258Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9947335Z     T=1,
2025-05-07T20:33:12.9947412Z     D=7168,
2025-05-07T20:33:12.9947490Z     scale_ub=None,
2025-05-07T20:33:12.9947575Z     contiguous=False,
2025-05-07T20:33:12.9947699Z     compiled=True,
2025-05-07T20:33:12.9947770Z )
2025-05-07T20:33:12.9947985Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9948147Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:12.9948151Z 
2025-05-07T20:33:12.9948224Z     @given(
2025-05-07T20:33:12.9948345Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9948441Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9948551Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9948734Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9948845Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9948918Z     )
2025-05-07T20:33:12.9949158Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9949245Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9949316Z         self,
2025-05-07T20:33:12.9949395Z         T: int,
2025-05-07T20:33:12.9949471Z         D: int,
2025-05-07T20:33:12.9949568Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9949655Z         contiguous: bool,
2025-05-07T20:33:12.9949736Z         compiled: bool,
2025-05-07T20:33:12.9949815Z     ) -> None:
2025-05-07T20:33:12.9949915Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9949999Z     
2025-05-07T20:33:12.9950190Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9950261Z     
2025-05-07T20:33:12.9950350Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9950478Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9950564Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9950641Z         x0 = x[:, :D]
2025-05-07T20:33:12.9950723Z         x1 = x[:, D:]
2025-05-07T20:33:12.9950792Z     
2025-05-07T20:33:12.9950871Z         if contiguous:
2025-05-07T20:33:12.9950961Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9951047Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9951119Z     
2025-05-07T20:33:12.9951211Z         if scale_ub is not None:
2025-05-07T20:33:12.9951313Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9951448Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9951519Z             )
2025-05-07T20:33:12.9951592Z         else:
2025-05-07T20:33:12.9951685Z             scale_ub_tensor = None
2025-05-07T20:33:12.9951751Z     
2025-05-07T20:33:12.9951876Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9951964Z             op = silu_mul_quant
2025-05-07T20:33:12.9952096Z             if compiled:
2025-05-07T20:33:12.9952194Z                 op = torch.compile(op)
2025-05-07T20:33:12.9952301Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9952369Z     
2025-05-07T20:33:12.9952459Z         y_fp8, y_scale = fn()
2025-05-07T20:33:12.9952579Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:12.9952648Z     
2025-05-07T20:33:12.9952785Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9952882Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:12.9952979Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:12.9953101Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:12.9953236Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9953308Z     
2025-05-07T20:33:12.9953413Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:12.9953418Z 
2025-05-07T20:33:12.9953509Z moe/activation_test.py:126: 
2025-05-07T20:33:12.9953683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9953787Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:12.9953916Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.9954465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:12.9954602Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:12.9954950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9955167Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9955523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:12.9955777Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:12.9956834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:12.9957002Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:12.9957339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:12.9957415Z     fn()
2025-05-07T20:33:12.9957810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:12.9957891Z     self.fn.run(
2025-05-07T20:33:12.9958221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9958314Z     kernel = self.compile(
2025-05-07T20:33:12.9958682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9958855Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9958986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9958990Z 
2025-05-07T20:33:12.9959368Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552719940>
2025-05-07T20:33:12.9960239Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9960739Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552298b80>}
2025-05-07T20:33:12.9961469Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9961667Z context = <triton._C.libtriton.ir.context object at 0x7f0552e4d5b0>
2025-05-07T20:33:12.9961759Z 
2025-05-07T20:33:12.9961942Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9962249Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9962360Z                            module_map=module_map)
2025-05-07T20:33:12.9962540Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9962654Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:12.9962729Z E       ^
2025-05-07T20:33:12.9963152Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9963156Z 
2025-05-07T20:33:12.9963647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9963651Z 
2025-05-07T20:33:12.9963757Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9964077Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9964156Z     T=1,
2025-05-07T20:33:12.9964231Z     D=5120,
2025-05-07T20:33:12.9964321Z     scale_ub=1200.0,
2025-05-07T20:33:12.9964407Z     contiguous=False,
2025-05-07T20:33:12.9964492Z     compiled=True,
2025-05-07T20:33:12.9964566Z )
2025-05-07T20:33:12.9964811Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9965058Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:12.9965063Z 
2025-05-07T20:33:12.9965137Z     @given(
2025-05-07T20:33:12.9965262Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9965367Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9965488Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9965613Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9965738Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9965871Z     )
2025-05-07T20:33:12.9966158Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9966254Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9966329Z         self,
2025-05-07T20:33:12.9966408Z         T: int,
2025-05-07T20:33:12.9966484Z         D: int,
2025-05-07T20:33:12.9966584Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9966676Z         contiguous: bool,
2025-05-07T20:33:12.9966764Z         compiled: bool,
2025-05-07T20:33:12.9966839Z     ) -> None:
2025-05-07T20:33:12.9966941Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9967012Z     
2025-05-07T20:33:12.9967195Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9967272Z     
2025-05-07T20:33:12.9967364Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9967498Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9967586Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9967667Z         x0 = x[:, :D]
2025-05-07T20:33:12.9967754Z         x1 = x[:, D:]
2025-05-07T20:33:12.9967824Z     
2025-05-07T20:33:12.9967910Z         if contiguous:
2025-05-07T20:33:12.9968001Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9968097Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9968168Z     
2025-05-07T20:33:12.9968258Z         if scale_ub is not None:
2025-05-07T20:33:12.9968369Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9968515Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9968589Z             )
2025-05-07T20:33:12.9968667Z         else:
2025-05-07T20:33:12.9968763Z             scale_ub_tensor = None
2025-05-07T20:33:12.9968837Z     
2025-05-07T20:33:12.9968974Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9969065Z             op = silu_mul_quant
2025-05-07T20:33:12.9969156Z             if compiled:
2025-05-07T20:33:12.9969258Z                 op = torch.compile(op)
2025-05-07T20:33:12.9969369Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9969488Z     
2025-05-07T20:33:12.9969579Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.9969583Z 
2025-05-07T20:33:12.9969681Z moe/activation_test.py:117: 
2025-05-07T20:33:12.9969820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9969925Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.9970044Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9970517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:12.9970611Z     return fn(*args, **kwargs)
2025-05-07T20:33:12.9971208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.9971306Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.9971765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9972028Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9972428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9972525Z     kernel = self.compile(
2025-05-07T20:33:12.9973091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9973309Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9973434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9973439Z 
2025-05-07T20:33:12.9973640Z self = <triton.compiler.compiler.ASTSource object at 0x7f055271a960>
2025-05-07T20:33:12.9974442Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9974937Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552299e40>}
2025-05-07T20:33:12.9975662Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9975854Z context = <triton._C.libtriton.ir.context object at 0x7f03a1c9a170>
2025-05-07T20:33:12.9975859Z 
2025-05-07T20:33:12.9976018Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9976275Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9976375Z                            module_map=module_map)
2025-05-07T20:33:12.9976534Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9976639Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.9976712Z E       ^
2025-05-07T20:33:12.9977058Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9977063Z 
2025-05-07T20:33:12.9977466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9977473Z 
2025-05-07T20:33:12.9977571Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9977789Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9977862Z     T=1,
2025-05-07T20:33:12.9977936Z     D=5120,
2025-05-07T20:33:12.9978019Z     scale_ub=1200.0,
2025-05-07T20:33:12.9978102Z     contiguous=False,
2025-05-07T20:33:12.9978181Z     compiled=False,
2025-05-07T20:33:12.9978249Z )
2025-05-07T20:33:12.9978461Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9978632Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:12.9978679Z 
2025-05-07T20:33:12.9978752Z     @given(
2025-05-07T20:33:12.9978869Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9978971Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9979081Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9979192Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9979317Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9979387Z     )
2025-05-07T20:33:12.9979627Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9979719Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9979791Z         self,
2025-05-07T20:33:12.9979865Z         T: int,
2025-05-07T20:33:12.9979941Z         D: int,
2025-05-07T20:33:12.9980038Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9980126Z         contiguous: bool,
2025-05-07T20:33:12.9980252Z         compiled: bool,
2025-05-07T20:33:12.9980329Z     ) -> None:
2025-05-07T20:33:12.9980425Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9980496Z     
2025-05-07T20:33:12.9980660Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9980736Z     
2025-05-07T20:33:12.9980825Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9980947Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9981101Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9981177Z         x0 = x[:, :D]
2025-05-07T20:33:12.9981252Z         x1 = x[:, D:]
2025-05-07T20:33:12.9981327Z     
2025-05-07T20:33:12.9981408Z         if contiguous:
2025-05-07T20:33:12.9981494Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9981582Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9981652Z     
2025-05-07T20:33:12.9981745Z         if scale_ub is not None:
2025-05-07T20:33:12.9981847Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9982022Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9982098Z             )
2025-05-07T20:33:12.9982170Z         else:
2025-05-07T20:33:12.9982261Z             scale_ub_tensor = None
2025-05-07T20:33:12.9982338Z     
2025-05-07T20:33:12.9982462Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9982547Z             op = silu_mul_quant
2025-05-07T20:33:12.9982631Z             if compiled:
2025-05-07T20:33:12.9982729Z                 op = torch.compile(op)
2025-05-07T20:33:12.9982830Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9982903Z     
2025-05-07T20:33:12.9982989Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.9982993Z 
2025-05-07T20:33:12.9983091Z moe/activation_test.py:117: 
2025-05-07T20:33:12.9983215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9983311Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.9983409Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9983904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.9983999Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.9984350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9984567Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9984905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9984997Z     kernel = self.compile(
2025-05-07T20:33:12.9985369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9985541Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9985662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9985712Z 
2025-05-07T20:33:12.9985921Z self = <triton.compiler.compiler.ASTSource object at 0x7f055271bfe0>
2025-05-07T20:33:12.9986677Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9987166Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f055229aac0>}
2025-05-07T20:33:12.9987898Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.9988085Z context = <triton._C.libtriton.ir.context object at 0x7f03a1f0ca70>
2025-05-07T20:33:12.9988089Z 
2025-05-07T20:33:12.9988294Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.9988551Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.9988654Z                            module_map=module_map)
2025-05-07T20:33:12.9988813Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.9988907Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.9988986Z E       ^
2025-05-07T20:33:12.9989372Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.9989376Z 
2025-05-07T20:33:12.9989780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.9989784Z 
2025-05-07T20:33:12.9989886Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.9990103Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.9990183Z     T=16384,
2025-05-07T20:33:12.9990298Z     D=5120,
2025-05-07T20:33:12.9990378Z     scale_ub=1200.0,
2025-05-07T20:33:12.9990465Z     contiguous=False,
2025-05-07T20:33:12.9990545Z     compiled=True,
2025-05-07T20:33:12.9990611Z )
2025-05-07T20:33:12.9990823Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.9990995Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:12.9991003Z 
2025-05-07T20:33:12.9991077Z     @given(
2025-05-07T20:33:12.9991196Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.9991290Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.9991401Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.9991518Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.9991627Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.9991698Z     )
2025-05-07T20:33:12.9991940Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.9992034Z     def test_silu_mul_quant(
2025-05-07T20:33:12.9992108Z         self,
2025-05-07T20:33:12.9992181Z         T: int,
2025-05-07T20:33:12.9992254Z         D: int,
2025-05-07T20:33:12.9992354Z         scale_ub: Optional[float],
2025-05-07T20:33:12.9992444Z         contiguous: bool,
2025-05-07T20:33:12.9992523Z         compiled: bool,
2025-05-07T20:33:12.9992600Z     ) -> None:
2025-05-07T20:33:12.9992694Z         torch.manual_seed(2025)
2025-05-07T20:33:12.9992761Z     
2025-05-07T20:33:12.9992927Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.9992997Z     
2025-05-07T20:33:12.9993094Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.9993215Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.9993298Z         x = x_sign * x_clamp
2025-05-07T20:33:12.9993378Z         x0 = x[:, :D]
2025-05-07T20:33:12.9993453Z         x1 = x[:, D:]
2025-05-07T20:33:12.9993523Z     
2025-05-07T20:33:12.9993610Z         if contiguous:
2025-05-07T20:33:12.9993745Z             x0 = x0.contiguous()
2025-05-07T20:33:12.9993830Z             x1 = x1.contiguous()
2025-05-07T20:33:12.9993900Z     
2025-05-07T20:33:12.9993987Z         if scale_ub is not None:
2025-05-07T20:33:12.9994088Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.9994219Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.9994291Z             )
2025-05-07T20:33:12.9994373Z         else:
2025-05-07T20:33:12.9994462Z             scale_ub_tensor = None
2025-05-07T20:33:12.9994532Z     
2025-05-07T20:33:12.9994662Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.9994748Z             op = silu_mul_quant
2025-05-07T20:33:12.9994826Z             if compiled:
2025-05-07T20:33:12.9994926Z                 op = torch.compile(op)
2025-05-07T20:33:12.9995025Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9995095Z     
2025-05-07T20:33:12.9995188Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.9995241Z 
2025-05-07T20:33:12.9995334Z moe/activation_test.py:117: 
2025-05-07T20:33:12.9995458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9995558Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.9995652Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.9996015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:12.9996147Z     return fn(*args, **kwargs)
2025-05-07T20:33:12.9996629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.9996727Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.9997075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.9997299Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.9997670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.9997760Z     kernel = self.compile(
2025-05-07T20:33:12.9998132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.9998305Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.9998428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.9998433Z 
2025-05-07T20:33:12.9998639Z self = <triton.compiler.compiler.ASTSource object at 0x7f055271bf50>
2025-05-07T20:33:12.9999393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.9999895Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1ff0180>}
2025-05-07T20:33:13.0000625Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0000812Z context = <triton._C.libtriton.ir.context object at 0x7f03a1fb1330>
2025-05-07T20:33:13.0000820Z 
2025-05-07T20:33:13.0000980Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0001232Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0001339Z                            module_map=module_map)
2025-05-07T20:33:13.0001498Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0001591Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0001665Z E       ^
2025-05-07T20:33:13.0002015Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0002060Z 
2025-05-07T20:33:13.0002467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0002472Z 
2025-05-07T20:33:13.0002572Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0002788Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0002865Z     T=2048,
2025-05-07T20:33:13.0002940Z     D=7168,
2025-05-07T20:33:13.0003020Z     scale_ub=1200.0,
2025-05-07T20:33:13.0003107Z     contiguous=False,
2025-05-07T20:33:13.0003187Z     compiled=True,
2025-05-07T20:33:13.0003264Z )
2025-05-07T20:33:13.0003477Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0003646Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:13.0003650Z 
2025-05-07T20:33:13.0003771Z     @given(
2025-05-07T20:33:13.0003889Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0003986Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0004100Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0004211Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0004333Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0004443Z     )
2025-05-07T20:33:13.0004682Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0004773Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0004845Z         self,
2025-05-07T20:33:13.0004918Z         T: int,
2025-05-07T20:33:13.0004995Z         D: int,
2025-05-07T20:33:13.0005090Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0005176Z         contiguous: bool,
2025-05-07T20:33:13.0005260Z         compiled: bool,
2025-05-07T20:33:13.0005331Z     ) -> None:
2025-05-07T20:33:13.0005422Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0005537Z     
2025-05-07T20:33:13.0005702Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0005774Z     
2025-05-07T20:33:13.0005866Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0005988Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0006079Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0006155Z         x0 = x[:, :D]
2025-05-07T20:33:13.0006235Z         x1 = x[:, D:]
2025-05-07T20:33:13.0006308Z     
2025-05-07T20:33:13.0006387Z         if contiguous:
2025-05-07T20:33:13.0006475Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0006560Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0006629Z     
2025-05-07T20:33:13.0006719Z         if scale_ub is not None:
2025-05-07T20:33:13.0006825Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0006954Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0007023Z             )
2025-05-07T20:33:13.0007103Z         else:
2025-05-07T20:33:13.0007200Z             scale_ub_tensor = None
2025-05-07T20:33:13.0007272Z     
2025-05-07T20:33:13.0007396Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0007483Z             op = silu_mul_quant
2025-05-07T20:33:13.0007566Z             if compiled:
2025-05-07T20:33:13.0007667Z                 op = torch.compile(op)
2025-05-07T20:33:13.0007768Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0007841Z     
2025-05-07T20:33:13.0007929Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0007933Z 
2025-05-07T20:33:13.0008027Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0008155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0008252Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0008351Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0008708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:13.0008862Z     return fn(*args, **kwargs)
2025-05-07T20:33:13.0009353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0009448Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0009795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0010032Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0010401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0010493Z     kernel = self.compile(
2025-05-07T20:33:13.0010864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0011034Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0011223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0011230Z 
2025-05-07T20:33:13.0011430Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1f03ce0>
2025-05-07T20:33:13.0012192Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0012725Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1ff0ea0>}
2025-05-07T20:33:13.0013506Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0013700Z context = <triton._C.libtriton.ir.context object at 0x7f05539c27f0>
2025-05-07T20:33:13.0013710Z 
2025-05-07T20:33:13.0013908Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0014166Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0014272Z                            module_map=module_map)
2025-05-07T20:33:13.0014427Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0014527Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0014601Z E       ^
2025-05-07T20:33:13.0014945Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0014954Z 
2025-05-07T20:33:13.0015358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0015362Z 
2025-05-07T20:33:13.0015459Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0015683Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0015761Z     T=1,
2025-05-07T20:33:13.0015831Z     D=5120,
2025-05-07T20:33:13.0015908Z     scale_ub=None,
2025-05-07T20:33:13.0015992Z     contiguous=False,
2025-05-07T20:33:13.0016072Z     compiled=False,
2025-05-07T20:33:13.0016142Z )
2025-05-07T20:33:13.0016359Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0016525Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:13.0016533Z 
2025-05-07T20:33:13.0016608Z     @given(
2025-05-07T20:33:13.0016726Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0016825Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0016937Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0017049Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0017166Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0017237Z     )
2025-05-07T20:33:13.0017480Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0017613Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0017687Z         self,
2025-05-07T20:33:13.0017770Z         T: int,
2025-05-07T20:33:13.0017842Z         D: int,
2025-05-07T20:33:13.0017939Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0018029Z         contiguous: bool,
2025-05-07T20:33:13.0018113Z         compiled: bool,
2025-05-07T20:33:13.0018190Z     ) -> None:
2025-05-07T20:33:13.0018282Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0018352Z     
2025-05-07T20:33:13.0018517Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0018590Z     
2025-05-07T20:33:13.0018679Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0018799Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0018889Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0018965Z         x0 = x[:, :D]
2025-05-07T20:33:13.0019042Z         x1 = x[:, D:]
2025-05-07T20:33:13.0019156Z     
2025-05-07T20:33:13.0019239Z         if contiguous:
2025-05-07T20:33:13.0019332Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0019418Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0019488Z     
2025-05-07T20:33:13.0019575Z         if scale_ub is not None:
2025-05-07T20:33:13.0019678Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0019808Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0019925Z             )
2025-05-07T20:33:13.0019999Z         else:
2025-05-07T20:33:13.0020089Z             scale_ub_tensor = None
2025-05-07T20:33:13.0020160Z     
2025-05-07T20:33:13.0020286Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0020373Z             op = silu_mul_quant
2025-05-07T20:33:13.0020452Z             if compiled:
2025-05-07T20:33:13.0020547Z                 op = torch.compile(op)
2025-05-07T20:33:13.0020651Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0020723Z     
2025-05-07T20:33:13.0020854Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0020859Z 
2025-05-07T20:33:13.0020955Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0021081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0021179Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0021276Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0021761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0021863Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0022211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0022426Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0022759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0022856Z     kernel = self.compile(
2025-05-07T20:33:13.0023226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0023402Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0023523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0023528Z 
2025-05-07T20:33:13.0023731Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1f03290>
2025-05-07T20:33:13.0024488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0024981Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1ff1e40>}
2025-05-07T20:33:13.0025759Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0025946Z context = <triton._C.libtriton.ir.context object at 0x7f055230f8f0>
2025-05-07T20:33:13.0025950Z 
2025-05-07T20:33:13.0026113Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0026368Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0026476Z                            module_map=module_map)
2025-05-07T20:33:13.0026633Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0026729Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0026808Z E       ^
2025-05-07T20:33:13.0027153Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0027200Z 
2025-05-07T20:33:13.0027607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0027616Z 
2025-05-07T20:33:13.0027714Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0027931Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0028005Z     T=4096,
2025-05-07T20:33:13.0028117Z     D=7168,
2025-05-07T20:33:13.0028197Z     scale_ub=1200.0,
2025-05-07T20:33:13.0028283Z     contiguous=False,
2025-05-07T20:33:13.0028361Z     compiled=False,
2025-05-07T20:33:13.0028431Z )
2025-05-07T20:33:13.0028645Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0028814Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:13.0028818Z 
2025-05-07T20:33:13.0028892Z     @given(
2025-05-07T20:33:13.0029008Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0029145Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0029260Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0029373Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0029485Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0029558Z     )
2025-05-07T20:33:13.0029796Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0029887Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0029965Z         self,
2025-05-07T20:33:13.0030038Z         T: int,
2025-05-07T20:33:13.0030112Z         D: int,
2025-05-07T20:33:13.0030213Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0030300Z         contiguous: bool,
2025-05-07T20:33:13.0030384Z         compiled: bool,
2025-05-07T20:33:13.0030459Z     ) -> None:
2025-05-07T20:33:13.0030550Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0030622Z     
2025-05-07T20:33:13.0030788Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0030866Z     
2025-05-07T20:33:13.0030961Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0031084Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0031168Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0031246Z         x0 = x[:, :D]
2025-05-07T20:33:13.0031322Z         x1 = x[:, D:]
2025-05-07T20:33:13.0031392Z     
2025-05-07T20:33:13.0031473Z         if contiguous:
2025-05-07T20:33:13.0031561Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0031650Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0031722Z     
2025-05-07T20:33:13.0031808Z         if scale_ub is not None:
2025-05-07T20:33:13.0031913Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0032044Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0032115Z             )
2025-05-07T20:33:13.0032188Z         else:
2025-05-07T20:33:13.0032279Z             scale_ub_tensor = None
2025-05-07T20:33:13.0032352Z     
2025-05-07T20:33:13.0032531Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0032619Z             op = silu_mul_quant
2025-05-07T20:33:13.0032698Z             if compiled:
2025-05-07T20:33:13.0032798Z                 op = torch.compile(op)
2025-05-07T20:33:13.0032897Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0032967Z     
2025-05-07T20:33:13.0033055Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0033062Z 
2025-05-07T20:33:13.0033156Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0033283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0033378Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0033471Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0033961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0034054Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0034446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0034675Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0035007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0035098Z     kernel = self.compile(
2025-05-07T20:33:13.0035471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0038843Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0038974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0038980Z 
2025-05-07T20:33:13.0039185Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1f02d50>
2025-05-07T20:33:13.0040012Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0040560Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1ff3380>}
2025-05-07T20:33:13.0041294Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0041484Z context = <triton._C.libtriton.ir.context object at 0x7f0552395b70>
2025-05-07T20:33:13.0041488Z 
2025-05-07T20:33:13.0041651Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0041905Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0042013Z                            module_map=module_map)
2025-05-07T20:33:13.0042178Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0042275Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0042353Z E       ^
2025-05-07T20:33:13.0042700Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0042705Z 
2025-05-07T20:33:13.0043108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0043115Z 
2025-05-07T20:33:13.0043222Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0043436Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0043515Z     T=16384,
2025-05-07T20:33:13.0043589Z     D=7168,
2025-05-07T20:33:13.0043669Z     scale_ub=None,
2025-05-07T20:33:13.0043754Z     contiguous=True,
2025-05-07T20:33:13.0043834Z     compiled=True,
2025-05-07T20:33:13.0043901Z )
2025-05-07T20:33:13.0044120Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0044355Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:13.0044360Z 
2025-05-07T20:33:13.0044434Z     @given(
2025-05-07T20:33:13.0044551Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0044645Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0044755Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0044875Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0044983Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0045065Z     )
2025-05-07T20:33:13.0045305Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0045393Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0045470Z         self,
2025-05-07T20:33:13.0045545Z         T: int,
2025-05-07T20:33:13.0045616Z         D: int,
2025-05-07T20:33:13.0045755Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0045848Z         contiguous: bool,
2025-05-07T20:33:13.0045928Z         compiled: bool,
2025-05-07T20:33:13.0046012Z     ) -> None:
2025-05-07T20:33:13.0046103Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0046171Z     
2025-05-07T20:33:13.0046337Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0046408Z     
2025-05-07T20:33:13.0046498Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0046662Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0046745Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0046825Z         x0 = x[:, :D]
2025-05-07T20:33:13.0046901Z         x1 = x[:, D:]
2025-05-07T20:33:13.0046972Z     
2025-05-07T20:33:13.0047052Z         if contiguous:
2025-05-07T20:33:13.0047138Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0047226Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0047298Z     
2025-05-07T20:33:13.0047383Z         if scale_ub is not None:
2025-05-07T20:33:13.0047530Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0047666Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0047739Z             )
2025-05-07T20:33:13.0047816Z         else:
2025-05-07T20:33:13.0047907Z             scale_ub_tensor = None
2025-05-07T20:33:13.0047978Z     
2025-05-07T20:33:13.0048105Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0048193Z             op = silu_mul_quant
2025-05-07T20:33:13.0048272Z             if compiled:
2025-05-07T20:33:13.0048371Z                 op = torch.compile(op)
2025-05-07T20:33:13.0048472Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0048541Z     
2025-05-07T20:33:13.0048629Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0048634Z 
2025-05-07T20:33:13.0048724Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0048850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0048947Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0049052Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0049414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:13.0049503Z     return fn(*args, **kwargs)
2025-05-07T20:33:13.0049987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0050087Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0050484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0050705Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0051035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0051124Z     kernel = self.compile(
2025-05-07T20:33:13.0051502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0051718Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0051842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0051846Z 
2025-05-07T20:33:13.0052048Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552370560>
2025-05-07T20:33:13.0052805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0053369Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05523904a0>}
2025-05-07T20:33:13.0054140Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0054331Z context = <triton._C.libtriton.ir.context object at 0x7f05523678b0>
2025-05-07T20:33:13.0054335Z 
2025-05-07T20:33:13.0054494Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0054746Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0054984Z                            module_map=module_map)
2025-05-07T20:33:13.0055142Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0055237Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0055315Z E       ^
2025-05-07T20:33:13.0055659Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0055664Z 
2025-05-07T20:33:13.0056116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0056124Z 
2025-05-07T20:33:13.0056222Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0056438Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0056513Z     T=4096,
2025-05-07T20:33:13.0056587Z     D=5120,
2025-05-07T20:33:13.0056668Z     scale_ub=None,
2025-05-07T20:33:13.0056754Z     contiguous=False,
2025-05-07T20:33:13.0056834Z     compiled=True,
2025-05-07T20:33:13.0056909Z )
2025-05-07T20:33:13.0057122Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0057288Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:13.0057292Z 
2025-05-07T20:33:13.0057372Z     @given(
2025-05-07T20:33:13.0057486Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0057583Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0057696Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0057814Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0057928Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0057998Z     )
2025-05-07T20:33:13.0058235Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0058329Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0058400Z         self,
2025-05-07T20:33:13.0058475Z         T: int,
2025-05-07T20:33:13.0058555Z         D: int,
2025-05-07T20:33:13.0058650Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0058735Z         contiguous: bool,
2025-05-07T20:33:13.0058819Z         compiled: bool,
2025-05-07T20:33:13.0058896Z     ) -> None:
2025-05-07T20:33:13.0058988Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0059063Z     
2025-05-07T20:33:13.0059430Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0059538Z     
2025-05-07T20:33:13.0059674Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0059804Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0059985Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0060060Z         x0 = x[:, :D]
2025-05-07T20:33:13.0060133Z         x1 = x[:, D:]
2025-05-07T20:33:13.0060209Z     
2025-05-07T20:33:13.0060289Z         if contiguous:
2025-05-07T20:33:13.0060375Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0060462Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0060531Z     
2025-05-07T20:33:13.0060620Z         if scale_ub is not None:
2025-05-07T20:33:13.0060726Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0060856Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0060935Z             )
2025-05-07T20:33:13.0061009Z         else:
2025-05-07T20:33:13.0061099Z             scale_ub_tensor = None
2025-05-07T20:33:13.0061172Z     
2025-05-07T20:33:13.0061297Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0061383Z             op = silu_mul_quant
2025-05-07T20:33:13.0061539Z             if compiled:
2025-05-07T20:33:13.0061635Z                 op = torch.compile(op)
2025-05-07T20:33:13.0061736Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0061809Z     
2025-05-07T20:33:13.0061895Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0061900Z 
2025-05-07T20:33:13.0061994Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0062121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0062289Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0062390Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0062749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:13.0062837Z     return fn(*args, **kwargs)
2025-05-07T20:33:13.0063321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0063420Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0063831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0064052Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0064382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0064480Z     kernel = self.compile(
2025-05-07T20:33:13.0064849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0065018Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0065143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0065147Z 
2025-05-07T20:33:13.0065344Z self = <triton.compiler.compiler.ASTSource object at 0x7f05523721e0>
2025-05-07T20:33:13.0066104Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0066598Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05523911c0>}
2025-05-07T20:33:13.0067325Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0067519Z context = <triton._C.libtriton.ir.context object at 0x7f0552eb4a30>
2025-05-07T20:33:13.0067523Z 
2025-05-07T20:33:13.0067687Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0067948Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0068100Z                            module_map=module_map)
2025-05-07T20:33:13.0068258Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0068355Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0068431Z E       ^
2025-05-07T20:33:13.0068778Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0068786Z 
2025-05-07T20:33:13.0069191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0069196Z 
2025-05-07T20:33:13.0069293Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0069510Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0069580Z     T=4096,
2025-05-07T20:33:13.0069655Z     D=5120,
2025-05-07T20:33:13.0069740Z     scale_ub=1200.0,
2025-05-07T20:33:13.0069823Z     contiguous=False,
2025-05-07T20:33:13.0069908Z     compiled=False,
2025-05-07T20:33:13.0070029Z )
2025-05-07T20:33:13.0070278Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0070472Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:13.0070477Z 
2025-05-07T20:33:13.0070549Z     @given(
2025-05-07T20:33:13.0070664Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0070764Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0070918Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0071030Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0071143Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0071215Z     )
2025-05-07T20:33:13.0071461Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0071551Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0071621Z         self,
2025-05-07T20:33:13.0071699Z         T: int,
2025-05-07T20:33:13.0071775Z         D: int,
2025-05-07T20:33:13.0071940Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0072030Z         contiguous: bool,
2025-05-07T20:33:13.0072111Z         compiled: bool,
2025-05-07T20:33:13.0072184Z     ) -> None:
2025-05-07T20:33:13.0072279Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0072349Z     
2025-05-07T20:33:13.0072512Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0072586Z     
2025-05-07T20:33:13.0072675Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0072795Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0072885Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0072960Z         x0 = x[:, :D]
2025-05-07T20:33:13.0073041Z         x1 = x[:, D:]
2025-05-07T20:33:13.0073110Z     
2025-05-07T20:33:13.0073189Z         if contiguous:
2025-05-07T20:33:13.0073280Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0073365Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0073433Z     
2025-05-07T20:33:13.0073531Z         if scale_ub is not None:
2025-05-07T20:33:13.0073632Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0073761Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0073833Z             )
2025-05-07T20:33:13.0073904Z         else:
2025-05-07T20:33:13.0073994Z             scale_ub_tensor = None
2025-05-07T20:33:13.0074069Z     
2025-05-07T20:33:13.0074195Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0074287Z             op = silu_mul_quant
2025-05-07T20:33:13.0074370Z             if compiled:
2025-05-07T20:33:13.0074464Z                 op = torch.compile(op)
2025-05-07T20:33:13.0074567Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0074634Z     
2025-05-07T20:33:13.0074719Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0074724Z 
2025-05-07T20:33:13.0074818Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0074945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0075092Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0075189Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0075677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0075774Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0076123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0076341Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0076676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0076767Z     kernel = self.compile(
2025-05-07T20:33:13.0077139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0077354Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0077480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0077485Z 
2025-05-07T20:33:13.0077684Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552373710>
2025-05-07T20:33:13.0078437Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0078970Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552392160>}
2025-05-07T20:33:13.0079698Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0079922Z context = <triton._C.libtriton.ir.context object at 0x7f03a1d50f70>
2025-05-07T20:33:13.0079928Z 
2025-05-07T20:33:13.0080095Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0080350Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0080480Z                            module_map=module_map)
2025-05-07T20:33:13.0080664Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0080755Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0080834Z E       ^
2025-05-07T20:33:13.0081178Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0081183Z 
2025-05-07T20:33:13.0081586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0081593Z 
2025-05-07T20:33:13.0081694Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0081912Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0081989Z     T=4096,
2025-05-07T20:33:13.0082061Z     D=5120,
2025-05-07T20:33:13.0082139Z     scale_ub=1200.0,
2025-05-07T20:33:13.0082227Z     contiguous=False,
2025-05-07T20:33:13.0082306Z     compiled=True,
2025-05-07T20:33:13.0082375Z )
2025-05-07T20:33:13.0082589Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0082764Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:13.0082769Z 
2025-05-07T20:33:13.0082849Z     @given(
2025-05-07T20:33:13.0082964Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0083059Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0083177Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0083288Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0083399Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0083518Z     )
2025-05-07T20:33:13.0083755Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0083843Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0083922Z         self,
2025-05-07T20:33:13.0083994Z         T: int,
2025-05-07T20:33:13.0084066Z         D: int,
2025-05-07T20:33:13.0084165Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0084252Z         contiguous: bool,
2025-05-07T20:33:13.0084337Z         compiled: bool,
2025-05-07T20:33:13.0084410Z     ) -> None:
2025-05-07T20:33:13.0084501Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0084573Z     
2025-05-07T20:33:13.0084739Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0084810Z     
2025-05-07T20:33:13.0084901Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0085021Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0085107Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0085235Z         x0 = x[:, :D]
2025-05-07T20:33:13.0085313Z         x1 = x[:, D:]
2025-05-07T20:33:13.0085382Z     
2025-05-07T20:33:13.0085463Z         if contiguous:
2025-05-07T20:33:13.0085550Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0085634Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0085704Z     
2025-05-07T20:33:13.0085789Z         if scale_ub is not None:
2025-05-07T20:33:13.0085895Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0086070Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0086140Z             )
2025-05-07T20:33:13.0086217Z         else:
2025-05-07T20:33:13.0086306Z             scale_ub_tensor = None
2025-05-07T20:33:13.0086376Z     
2025-05-07T20:33:13.0086503Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0086588Z             op = silu_mul_quant
2025-05-07T20:33:13.0086668Z             if compiled:
2025-05-07T20:33:13.0086766Z                 op = torch.compile(op)
2025-05-07T20:33:13.0086915Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0086985Z     
2025-05-07T20:33:13.0087077Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0087082Z 
2025-05-07T20:33:13.0087173Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0087296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0087397Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0087494Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0087852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:13.0087944Z     return fn(*args, **kwargs)
2025-05-07T20:33:13.0088424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0088522Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0088875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0089095Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0089429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0089518Z     kernel = self.compile(
2025-05-07T20:33:13.0089886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0090060Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0090183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0090187Z 
2025-05-07T20:33:13.0090387Z self = <triton.compiler.compiler.ASTSource object at 0x7f05523732c0>
2025-05-07T20:33:13.0091151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0091688Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552393240>}
2025-05-07T20:33:13.0092413Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0092600Z context = <triton._C.libtriton.ir.context object at 0x7f0552e69870>
2025-05-07T20:33:13.0092605Z 
2025-05-07T20:33:13.0092768Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0093085Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0093192Z                            module_map=module_map)
2025-05-07T20:33:13.0093391Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0093494Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0093571Z E       ^
2025-05-07T20:33:13.0093916Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0093920Z 
2025-05-07T20:33:13.0094323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0094372Z 
2025-05-07T20:33:13.0094470Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0094686Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0094762Z     T=2048,
2025-05-07T20:33:13.0094833Z     D=7168,
2025-05-07T20:33:13.0094910Z     scale_ub=1200.0,
2025-05-07T20:33:13.0094995Z     contiguous=False,
2025-05-07T20:33:13.0095074Z     compiled=False,
2025-05-07T20:33:13.0095144Z )
2025-05-07T20:33:13.0095364Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0095577Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:13.0095583Z 
2025-05-07T20:33:13.0095661Z     @given(
2025-05-07T20:33:13.0095775Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0095870Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0095984Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0096100Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0096207Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0096282Z     )
2025-05-07T20:33:13.0096522Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0096610Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0096685Z         self,
2025-05-07T20:33:13.0096759Z         T: int,
2025-05-07T20:33:13.0096832Z         D: int,
2025-05-07T20:33:13.0096928Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0097015Z         contiguous: bool,
2025-05-07T20:33:13.0097103Z         compiled: bool,
2025-05-07T20:33:13.0097179Z     ) -> None:
2025-05-07T20:33:13.0097270Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0097345Z     
2025-05-07T20:33:13.0097507Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0097574Z     
2025-05-07T20:33:13.0097670Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0097793Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0097881Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0097958Z         x0 = x[:, :D]
2025-05-07T20:33:13.0098034Z         x1 = x[:, D:]
2025-05-07T20:33:13.0098104Z     
2025-05-07T20:33:13.0098187Z         if contiguous:
2025-05-07T20:33:13.0098274Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0098358Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0098427Z     
2025-05-07T20:33:13.0098513Z         if scale_ub is not None:
2025-05-07T20:33:13.0098612Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0098796Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0098865Z             )
2025-05-07T20:33:13.0098943Z         else:
2025-05-07T20:33:13.0099033Z             scale_ub_tensor = None
2025-05-07T20:33:13.0099103Z     
2025-05-07T20:33:13.0099229Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0099315Z             op = silu_mul_quant
2025-05-07T20:33:13.0099398Z             if compiled:
2025-05-07T20:33:13.0099496Z                 op = torch.compile(op)
2025-05-07T20:33:13.0099601Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0099670Z     
2025-05-07T20:33:13.0099760Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0099765Z 
2025-05-07T20:33:13.0099860Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0099988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0100094Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0100254Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0100776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0100870Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0101220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0101438Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0101831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0101923Z     kernel = self.compile(
2025-05-07T20:33:13.0102296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0102466Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0102594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0102638Z 
2025-05-07T20:33:13.0102836Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552e944d0>
2025-05-07T20:33:13.0103594Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0104090Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552ea4220>}
2025-05-07T20:33:13.0104816Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0105003Z context = <triton._C.libtriton.ir.context object at 0x7f03a1eccab0>
2025-05-07T20:33:13.0105007Z 
2025-05-07T20:33:13.0105172Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0105429Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0105531Z                            module_map=module_map)
2025-05-07T20:33:13.0105687Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0105784Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0105862Z E       ^
2025-05-07T20:33:13.0106206Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0106211Z 
2025-05-07T20:33:13.0106617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0106621Z 
2025-05-07T20:33:13.0106721Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0106939Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0107062Z     T=1,
2025-05-07T20:33:13.0107140Z     D=7168,
2025-05-07T20:33:13.0107220Z     scale_ub=None,
2025-05-07T20:33:13.0107302Z     contiguous=True,
2025-05-07T20:33:13.0107381Z     compiled=False,
2025-05-07T20:33:13.0107455Z )
2025-05-07T20:33:13.0107665Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0107826Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:13.0107833Z 
2025-05-07T20:33:13.0107905Z     @given(
2025-05-07T20:33:13.0108019Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0108117Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0108226Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0108339Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0108451Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0108521Z     )
2025-05-07T20:33:13.0108802Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0108903Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0108975Z         self,
2025-05-07T20:33:13.0109053Z         T: int,
2025-05-07T20:33:13.0109128Z         D: int,
2025-05-07T20:33:13.0109221Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0109309Z         contiguous: bool,
2025-05-07T20:33:13.0109391Z         compiled: bool,
2025-05-07T20:33:13.0109462Z     ) -> None:
2025-05-07T20:33:13.0109604Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0109674Z     
2025-05-07T20:33:13.0109837Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0109912Z     
2025-05-07T20:33:13.0109999Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0110119Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0110211Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0110286Z         x0 = x[:, :D]
2025-05-07T20:33:13.0110366Z         x1 = x[:, D:]
2025-05-07T20:33:13.0110453Z     
2025-05-07T20:33:13.0110590Z         if contiguous:
2025-05-07T20:33:13.0110696Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0110782Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0110852Z     
2025-05-07T20:33:13.0110938Z         if scale_ub is not None:
2025-05-07T20:33:13.0111039Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0111168Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0111246Z             )
2025-05-07T20:33:13.0111320Z         else:
2025-05-07T20:33:13.0111412Z             scale_ub_tensor = None
2025-05-07T20:33:13.0111485Z     
2025-05-07T20:33:13.0111610Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0111696Z             op = silu_mul_quant
2025-05-07T20:33:13.0111779Z             if compiled:
2025-05-07T20:33:13.0111874Z                 op = torch.compile(op)
2025-05-07T20:33:13.0111978Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0112045Z     
2025-05-07T20:33:13.0112133Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0112142Z 
2025-05-07T20:33:13.0112238Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0112362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0112457Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0112554Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0113040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0113139Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0113489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0113704Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0114037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0114127Z     kernel = self.compile(
2025-05-07T20:33:13.0114549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0114719Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0114842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0114846Z 
2025-05-07T20:33:13.0115047Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552e96660>
2025-05-07T20:33:13.0115809Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0116297Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552ea5120>}
2025-05-07T20:33:13.0117084Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0117272Z context = <triton._C.libtriton.ir.context object at 0x7f05539007f0>
2025-05-07T20:33:13.0117276Z 
2025-05-07T20:33:13.0117438Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0117731Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0117833Z                            module_map=module_map)
2025-05-07T20:33:13.0117993Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0118092Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0118166Z E       ^
2025-05-07T20:33:13.0118512Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0118517Z 
2025-05-07T20:33:13.0118958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0118966Z 
2025-05-07T20:33:13.0119067Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0119282Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0119358Z     T=16384,
2025-05-07T20:33:13.0119429Z     D=7168,
2025-05-07T20:33:13.0119512Z     scale_ub=1200.0,
2025-05-07T20:33:13.0119599Z     contiguous=False,
2025-05-07T20:33:13.0119678Z     compiled=True,
2025-05-07T20:33:13.0119746Z )
2025-05-07T20:33:13.0119960Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0120132Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:13.0120136Z 
2025-05-07T20:33:13.0120211Z     @given(
2025-05-07T20:33:13.0120329Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0120424Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0120543Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0120655Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0120764Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0120838Z     )
2025-05-07T20:33:13.0121076Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0121164Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0121244Z         self,
2025-05-07T20:33:13.0121318Z         T: int,
2025-05-07T20:33:13.0121394Z         D: int,
2025-05-07T20:33:13.0121492Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0121578Z         contiguous: bool,
2025-05-07T20:33:13.0121659Z         compiled: bool,
2025-05-07T20:33:13.0121737Z     ) -> None:
2025-05-07T20:33:13.0121827Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0121896Z     
2025-05-07T20:33:13.0122059Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0122127Z     
2025-05-07T20:33:13.0122268Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0122391Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0122474Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0122555Z         x0 = x[:, :D]
2025-05-07T20:33:13.0122630Z         x1 = x[:, D:]
2025-05-07T20:33:13.0122699Z     
2025-05-07T20:33:13.0122783Z         if contiguous:
2025-05-07T20:33:13.0122869Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0122953Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0123025Z     
2025-05-07T20:33:13.0123111Z         if scale_ub is not None:
2025-05-07T20:33:13.0123213Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0123343Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0123414Z             )
2025-05-07T20:33:13.0123490Z         else:
2025-05-07T20:33:13.0123579Z             scale_ub_tensor = None
2025-05-07T20:33:13.0123648Z     
2025-05-07T20:33:13.0123821Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0123913Z             op = silu_mul_quant
2025-05-07T20:33:13.0123992Z             if compiled:
2025-05-07T20:33:13.0124093Z                 op = torch.compile(op)
2025-05-07T20:33:13.0124195Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0124263Z     
2025-05-07T20:33:13.0124352Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0124356Z 
2025-05-07T20:33:13.0124448Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0124617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0124713Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0124810Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0125171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:13.0125259Z     return fn(*args, **kwargs)
2025-05-07T20:33:13.0125782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0125886Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0126234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0126454Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0126783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0126873Z     kernel = self.compile(
2025-05-07T20:33:13.0127246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0127417Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0127539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0127547Z 
2025-05-07T20:33:13.0127748Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552e95c70>
2025-05-07T20:33:13.0128508Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0129005Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552ea6520>}
2025-05-07T20:33:13.0129735Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0129921Z context = <triton._C.libtriton.ir.context object at 0x7f055398bb30>
2025-05-07T20:33:13.0129925Z 
2025-05-07T20:33:13.0130087Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0130387Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0130538Z                            module_map=module_map)
2025-05-07T20:33:13.0130695Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0130793Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0130869Z E       ^
2025-05-07T20:33:13.0131213Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0131221Z 
2025-05-07T20:33:13.0131626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0131630Z 
2025-05-07T20:33:13.0131727Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0131941Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0132016Z     T=1,
2025-05-07T20:33:13.0132092Z     D=7168,
2025-05-07T20:33:13.0132172Z     scale_ub=None,
2025-05-07T20:33:13.0132323Z     contiguous=False,
2025-05-07T20:33:13.0132410Z     compiled=False,
2025-05-07T20:33:13.0132485Z )
2025-05-07T20:33:13.0132698Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0132859Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:13.0132864Z 
2025-05-07T20:33:13.0132940Z     @given(
2025-05-07T20:33:13.0133103Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0133243Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0133355Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0133466Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0133582Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0133654Z     )
2025-05-07T20:33:13.0133893Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0133983Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0134061Z         self,
2025-05-07T20:33:13.0134176Z         T: int,
2025-05-07T20:33:13.0134255Z         D: int,
2025-05-07T20:33:13.0134349Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0134433Z         contiguous: bool,
2025-05-07T20:33:13.0134518Z         compiled: bool,
2025-05-07T20:33:13.0134596Z     ) -> None:
2025-05-07T20:33:13.0134688Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0134761Z     
2025-05-07T20:33:13.0134927Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0135002Z     
2025-05-07T20:33:13.0135091Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0135214Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0135305Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0135380Z         x0 = x[:, :D]
2025-05-07T20:33:13.0135454Z         x1 = x[:, D:]
2025-05-07T20:33:13.0135523Z     
2025-05-07T20:33:13.0135602Z         if contiguous:
2025-05-07T20:33:13.0135689Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0135782Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0135856Z     
2025-05-07T20:33:13.0135941Z         if scale_ub is not None:
2025-05-07T20:33:13.0136046Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0136174Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0136243Z             )
2025-05-07T20:33:13.0136316Z         else:
2025-05-07T20:33:13.0136406Z             scale_ub_tensor = None
2025-05-07T20:33:13.0136481Z     
2025-05-07T20:33:13.0136606Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0136691Z             op = silu_mul_quant
2025-05-07T20:33:13.0136773Z             if compiled:
2025-05-07T20:33:13.0136869Z                 op = torch.compile(op)
2025-05-07T20:33:13.0136972Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0137039Z     
2025-05-07T20:33:13.0137127Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0137131Z 
2025-05-07T20:33:13.0137224Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0137355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0137514Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0137610Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0138096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0138188Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0138542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0138756Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0139086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0139178Z     kernel = self.compile(
2025-05-07T20:33:13.0139591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0139767Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0139888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0139893Z 
2025-05-07T20:33:13.0140091Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552e97380>
2025-05-07T20:33:13.0140901Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0141431Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552ea7100>}
2025-05-07T20:33:13.0142200Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0142388Z context = <triton._C.libtriton.ir.context object at 0x7f03a1eb2eb0>
2025-05-07T20:33:13.0142393Z 
2025-05-07T20:33:13.0142556Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0142809Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0142914Z                            module_map=module_map)
2025-05-07T20:33:13.0143073Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0143168Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0143242Z E       ^
2025-05-07T20:33:13.0143592Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0143597Z 
2025-05-07T20:33:13.0143998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0144007Z 
2025-05-07T20:33:13.0144110Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0144326Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0144399Z     T=2048,
2025-05-07T20:33:13.0144472Z     D=7168,
2025-05-07T20:33:13.0144549Z     scale_ub=None,
2025-05-07T20:33:13.0144631Z     contiguous=False,
2025-05-07T20:33:13.0144710Z     compiled=True,
2025-05-07T20:33:13.0144781Z )
2025-05-07T20:33:13.0144993Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0145166Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:13.0145170Z 
2025-05-07T20:33:13.0145241Z     @given(
2025-05-07T20:33:13.0145358Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0145452Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0145562Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0145682Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0145840Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0145913Z     )
2025-05-07T20:33:13.0146156Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0146246Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0146323Z         self,
2025-05-07T20:33:13.0146398Z         T: int,
2025-05-07T20:33:13.0146472Z         D: int,
2025-05-07T20:33:13.0146572Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0146658Z         contiguous: bool,
2025-05-07T20:33:13.0146739Z         compiled: bool,
2025-05-07T20:33:13.0146817Z     ) -> None:
2025-05-07T20:33:13.0146907Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0146974Z     
2025-05-07T20:33:13.0147139Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0147206Z     
2025-05-07T20:33:13.0147292Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0147457Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0147547Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0147624Z         x0 = x[:, :D]
2025-05-07T20:33:13.0147700Z         x1 = x[:, D:]
2025-05-07T20:33:13.0147768Z     
2025-05-07T20:33:13.0147852Z         if contiguous:
2025-05-07T20:33:13.0147938Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0148024Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0148094Z     
2025-05-07T20:33:13.0148224Z         if scale_ub is not None:
2025-05-07T20:33:13.0148325Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0148456Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0148523Z             )
2025-05-07T20:33:13.0148595Z         else:
2025-05-07T20:33:13.0148688Z             scale_ub_tensor = None
2025-05-07T20:33:13.0148756Z     
2025-05-07T20:33:13.0148883Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0148968Z             op = silu_mul_quant
2025-05-07T20:33:13.0149048Z             if compiled:
2025-05-07T20:33:13.0149189Z                 op = torch.compile(op)
2025-05-07T20:33:13.0149292Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0149361Z     
2025-05-07T20:33:13.0149447Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0149452Z 
2025-05-07T20:33:13.0149544Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0149672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0149777Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0149873Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0150231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:13.0150323Z     return fn(*args, **kwargs)
2025-05-07T20:33:13.0150807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0150901Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0151253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0151471Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0151803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0151893Z     kernel = self.compile(
2025-05-07T20:33:13.0152271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0152439Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0152562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0152566Z 
2025-05-07T20:33:13.0152766Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1e88980>
2025-05-07T20:33:13.0153524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0154062Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1edc720>}
2025-05-07T20:33:13.0154792Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0154978Z context = <triton._C.libtriton.ir.context object at 0x7f03a1eac3f0>
2025-05-07T20:33:13.0154982Z 
2025-05-07T20:33:13.0155147Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0155398Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0155544Z                            module_map=module_map)
2025-05-07T20:33:13.0155708Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0155804Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0155878Z E       ^
2025-05-07T20:33:13.0156221Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0156226Z 
2025-05-07T20:33:13.0160070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0160192Z 
2025-05-07T20:33:13.0160313Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0160540Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0160612Z     T=4096,
2025-05-07T20:33:13.0160682Z     D=7168,
2025-05-07T20:33:13.0160760Z     scale_ub=None,
2025-05-07T20:33:13.0160842Z     contiguous=False,
2025-05-07T20:33:13.0160921Z     compiled=True,
2025-05-07T20:33:13.0160993Z )
2025-05-07T20:33:13.0161272Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0161452Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:13.0161457Z 
2025-05-07T20:33:13.0161529Z     @given(
2025-05-07T20:33:13.0161646Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0161744Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0161857Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0161971Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0162086Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0162159Z     )
2025-05-07T20:33:13.0162402Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0162496Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0162568Z         self,
2025-05-07T20:33:13.0162647Z         T: int,
2025-05-07T20:33:13.0162718Z         D: int,
2025-05-07T20:33:13.0162815Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0162909Z         contiguous: bool,
2025-05-07T20:33:13.0162989Z         compiled: bool,
2025-05-07T20:33:13.0163063Z     ) -> None:
2025-05-07T20:33:13.0163155Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0163223Z     
2025-05-07T20:33:13.0163389Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0163465Z     
2025-05-07T20:33:13.0163556Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0163682Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0163769Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0163847Z         x0 = x[:, :D]
2025-05-07T20:33:13.0163925Z         x1 = x[:, D:]
2025-05-07T20:33:13.0163994Z     
2025-05-07T20:33:13.0164072Z         if contiguous:
2025-05-07T20:33:13.0164162Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0164248Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0164318Z     
2025-05-07T20:33:13.0164408Z         if scale_ub is not None:
2025-05-07T20:33:13.0164604Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0164735Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0164807Z             )
2025-05-07T20:33:13.0164880Z         else:
2025-05-07T20:33:13.0164971Z             scale_ub_tensor = None
2025-05-07T20:33:13.0165039Z     
2025-05-07T20:33:13.0165164Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0165255Z             op = silu_mul_quant
2025-05-07T20:33:13.0165339Z             if compiled:
2025-05-07T20:33:13.0165437Z                 op = torch.compile(op)
2025-05-07T20:33:13.0165540Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0165607Z     
2025-05-07T20:33:13.0165695Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0165699Z 
2025-05-07T20:33:13.0165793Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0165918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0166075Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0166183Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0166549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:13.0166642Z     return fn(*args, **kwargs)
2025-05-07T20:33:13.0167127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0167262Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0167614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0167832Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0168163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0168256Z     kernel = self.compile(
2025-05-07T20:33:13.0168678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0168854Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0168977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0168981Z 
2025-05-07T20:33:13.0169183Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1e8bc80>
2025-05-07T20:33:13.0169947Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0170494Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1edd440>}
2025-05-07T20:33:13.0171233Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0171421Z context = <triton._C.libtriton.ir.context object at 0x7f03a1e95730>
2025-05-07T20:33:13.0171425Z 
2025-05-07T20:33:13.0171588Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0171843Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0171952Z                            module_map=module_map)
2025-05-07T20:33:13.0172114Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0172208Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0172280Z E       ^
2025-05-07T20:33:13.0172628Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0172632Z 
2025-05-07T20:33:13.0173103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0173154Z 
2025-05-07T20:33:13.0173260Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0173480Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0173553Z     T=16384,
2025-05-07T20:33:13.0173628Z     D=5120,
2025-05-07T20:33:13.0173705Z     scale_ub=1200.0,
2025-05-07T20:33:13.0173790Z     contiguous=False,
2025-05-07T20:33:13.0173879Z     compiled=False,
2025-05-07T20:33:13.0173950Z )
2025-05-07T20:33:13.0174160Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0174345Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:13.0174350Z 
2025-05-07T20:33:13.0174425Z     @given(
2025-05-07T20:33:13.0174544Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0174641Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0174795Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0174918Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0175028Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0175099Z     )
2025-05-07T20:33:13.0175342Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0175429Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0175507Z         self,
2025-05-07T20:33:13.0175629Z         T: int,
2025-05-07T20:33:13.0175702Z         D: int,
2025-05-07T20:33:13.0175803Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0175891Z         contiguous: bool,
2025-05-07T20:33:13.0175975Z         compiled: bool,
2025-05-07T20:33:13.0176055Z     ) -> None:
2025-05-07T20:33:13.0176146Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0176217Z     
2025-05-07T20:33:13.0176383Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0176456Z     
2025-05-07T20:33:13.0176542Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0176718Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0176806Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0176885Z         x0 = x[:, :D]
2025-05-07T20:33:13.0176966Z         x1 = x[:, D:]
2025-05-07T20:33:13.0177035Z     
2025-05-07T20:33:13.0177122Z         if contiguous:
2025-05-07T20:33:13.0177209Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0177294Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0177374Z     
2025-05-07T20:33:13.0177461Z         if scale_ub is not None:
2025-05-07T20:33:13.0177563Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0177698Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0177769Z             )
2025-05-07T20:33:13.0177840Z         else:
2025-05-07T20:33:13.0177933Z             scale_ub_tensor = None
2025-05-07T20:33:13.0178000Z     
2025-05-07T20:33:13.0178126Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0178218Z             op = silu_mul_quant
2025-05-07T20:33:13.0178305Z             if compiled:
2025-05-07T20:33:13.0178403Z                 op = torch.compile(op)
2025-05-07T20:33:13.0178510Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0178578Z     
2025-05-07T20:33:13.0178669Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0178674Z 
2025-05-07T20:33:13.0178767Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0178892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0178998Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0179096Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0179589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0179687Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0180036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0180325Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0180691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0180781Z     kernel = self.compile(
2025-05-07T20:33:13.0181155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0181327Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0181452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0181457Z 
2025-05-07T20:33:13.0181655Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1e8bc50>
2025-05-07T20:33:13.0182537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0183039Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1ede340>}
2025-05-07T20:33:13.0183767Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0183995Z context = <triton._C.libtriton.ir.context object at 0x7f03a1c122b0>
2025-05-07T20:33:13.0184000Z 
2025-05-07T20:33:13.0184161Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0184414Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0184525Z                            module_map=module_map)
2025-05-07T20:33:13.0184685Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0184830Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0184908Z E       ^
2025-05-07T20:33:13.0185255Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0185259Z 
2025-05-07T20:33:13.0185665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0185672Z 
2025-05-07T20:33:13.0185774Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0185995Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0186069Z     T=16384,
2025-05-07T20:33:13.0186143Z     D=5120,
2025-05-07T20:33:13.0186228Z     scale_ub=1200.0,
2025-05-07T20:33:13.0186310Z     contiguous=True,
2025-05-07T20:33:13.0186389Z     compiled=True,
2025-05-07T20:33:13.0186463Z )
2025-05-07T20:33:13.0186677Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0186852Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:13.0186859Z 
2025-05-07T20:33:13.0186937Z     @given(
2025-05-07T20:33:13.0187055Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0187150Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0187265Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0187379Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0187495Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0187565Z     )
2025-05-07T20:33:13.0187804Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0187895Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0187967Z         self,
2025-05-07T20:33:13.0188042Z         T: int,
2025-05-07T20:33:13.0188121Z         D: int,
2025-05-07T20:33:13.0188220Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0188309Z         contiguous: bool,
2025-05-07T20:33:13.0188398Z         compiled: bool,
2025-05-07T20:33:13.0188521Z     ) -> None:
2025-05-07T20:33:13.0188613Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0188686Z     
2025-05-07T20:33:13.0188850Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0188926Z     
2025-05-07T20:33:13.0189014Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0189137Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0189232Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0189308Z         x0 = x[:, :D]
2025-05-07T20:33:13.0189385Z         x1 = x[:, D:]
2025-05-07T20:33:13.0189459Z     
2025-05-07T20:33:13.0189540Z         if contiguous:
2025-05-07T20:33:13.0189627Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0189718Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0189788Z     
2025-05-07T20:33:13.0189876Z         if scale_ub is not None:
2025-05-07T20:33:13.0189982Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0190156Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0190236Z             )
2025-05-07T20:33:13.0190310Z         else:
2025-05-07T20:33:13.0190402Z             scale_ub_tensor = None
2025-05-07T20:33:13.0190476Z     
2025-05-07T20:33:13.0190603Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0190690Z             op = silu_mul_quant
2025-05-07T20:33:13.0190775Z             if compiled:
2025-05-07T20:33:13.0190914Z                 op = torch.compile(op)
2025-05-07T20:33:13.0191014Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0191088Z     
2025-05-07T20:33:13.0191174Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0191178Z 
2025-05-07T20:33:13.0191270Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0191398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0191496Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0191596Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0192023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:13.0192115Z     return fn(*args, **kwargs)
2025-05-07T20:33:13.0192598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0192691Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0193040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0193263Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0193599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0193695Z     kernel = self.compile(
2025-05-07T20:33:13.0194066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0194240Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0194368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0194372Z 
2025-05-07T20:33:13.0194572Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1e8be30>
2025-05-07T20:33:13.0195331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0195825Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1edf9c0>}
2025-05-07T20:33:13.0196556Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0196789Z context = <triton._C.libtriton.ir.context object at 0x7f03a1ca7730>
2025-05-07T20:33:13.0196794Z 
2025-05-07T20:33:13.0196953Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0197210Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0197313Z                            module_map=module_map)
2025-05-07T20:33:13.0197471Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0197570Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0197646Z E       ^
2025-05-07T20:33:13.0197994Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0197999Z 
2025-05-07T20:33:13.0198401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0198405Z 
2025-05-07T20:33:13.0198547Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0198770Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0198846Z     T=16384,
2025-05-07T20:33:13.0198924Z     D=5120,
2025-05-07T20:33:13.0199004Z     scale_ub=None,
2025-05-07T20:33:13.0199089Z     contiguous=False,
2025-05-07T20:33:13.0199172Z     compiled=True,
2025-05-07T20:33:13.0199242Z )
2025-05-07T20:33:13.0199455Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0199669Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:13.0199674Z 
2025-05-07T20:33:13.0199746Z     @given(
2025-05-07T20:33:13.0199860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0199960Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0200073Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0200188Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0200365Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0200450Z     )
2025-05-07T20:33:13.0200706Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0200795Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0200866Z         self,
2025-05-07T20:33:13.0200949Z         T: int,
2025-05-07T20:33:13.0201023Z         D: int,
2025-05-07T20:33:13.0201119Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0201213Z         contiguous: bool,
2025-05-07T20:33:13.0201302Z         compiled: bool,
2025-05-07T20:33:13.0201376Z     ) -> None:
2025-05-07T20:33:13.0201473Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0201539Z     
2025-05-07T20:33:13.0201701Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0201774Z     
2025-05-07T20:33:13.0201862Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0201985Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0202074Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0202156Z         x0 = x[:, :D]
2025-05-07T20:33:13.0202237Z         x1 = x[:, D:]
2025-05-07T20:33:13.0202306Z     
2025-05-07T20:33:13.0202387Z         if contiguous:
2025-05-07T20:33:13.0202477Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0202562Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0202632Z     
2025-05-07T20:33:13.0202723Z         if scale_ub is not None:
2025-05-07T20:33:13.0202828Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0202957Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0203033Z             )
2025-05-07T20:33:13.0203106Z         else:
2025-05-07T20:33:13.0203199Z             scale_ub_tensor = None
2025-05-07T20:33:13.0203266Z     
2025-05-07T20:33:13.0203390Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0203480Z             op = silu_mul_quant
2025-05-07T20:33:13.0203562Z             if compiled:
2025-05-07T20:33:13.0203660Z                 op = torch.compile(op)
2025-05-07T20:33:13.0203816Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0203888Z     
2025-05-07T20:33:13.0203974Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0203978Z 
2025-05-07T20:33:13.0204074Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0204197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0204298Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0204396Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0204755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:13.0204848Z     return fn(*args, **kwargs)
2025-05-07T20:33:13.0205330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0205425Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0205824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0206044Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0206377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0206466Z     kernel = self.compile(
2025-05-07T20:33:13.0206841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0207053Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0207180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0207184Z 
2025-05-07T20:33:13.0207386Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1c5f830>
2025-05-07T20:33:13.0208185Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0208680Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1c00c20>}
2025-05-07T20:33:13.0209408Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0209597Z context = <triton._C.libtriton.ir.context object at 0x7f03a1ba6830>
2025-05-07T20:33:13.0209601Z 
2025-05-07T20:33:13.0209764Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0210020Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0210125Z                            module_map=module_map)
2025-05-07T20:33:13.0210291Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0210414Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0210491Z E       ^
2025-05-07T20:33:13.0210863Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0210867Z 
2025-05-07T20:33:13.0211270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0211277Z 
2025-05-07T20:33:13.0211377Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0211596Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0211667Z     T=2048,
2025-05-07T20:33:13.0211739Z     D=5120,
2025-05-07T20:33:13.0211820Z     scale_ub=None,
2025-05-07T20:33:13.0211908Z     contiguous=False,
2025-05-07T20:33:13.0211990Z     compiled=True,
2025-05-07T20:33:13.0212059Z )
2025-05-07T20:33:13.0212276Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0212495Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:13.0212499Z 
2025-05-07T20:33:13.0212576Z     @given(
2025-05-07T20:33:13.0212690Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0212789Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0212899Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0213092Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0213204Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0213272Z     )
2025-05-07T20:33:13.0213515Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0213604Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0213676Z         self,
2025-05-07T20:33:13.0213752Z         T: int,
2025-05-07T20:33:13.0213823Z         D: int,
2025-05-07T20:33:13.0213918Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0214049Z         contiguous: bool,
2025-05-07T20:33:13.0214139Z         compiled: bool,
2025-05-07T20:33:13.0214214Z     ) -> None:
2025-05-07T20:33:13.0214307Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0214378Z     
2025-05-07T20:33:13.0214542Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0214613Z     
2025-05-07T20:33:13.0214701Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0214869Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0214956Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0215031Z         x0 = x[:, :D]
2025-05-07T20:33:13.0215112Z         x1 = x[:, D:]
2025-05-07T20:33:13.0215179Z     
2025-05-07T20:33:13.0215259Z         if contiguous:
2025-05-07T20:33:13.0215348Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0215434Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0215503Z     
2025-05-07T20:33:13.0215590Z         if scale_ub is not None:
2025-05-07T20:33:13.0215699Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0215873Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0215948Z             )
2025-05-07T20:33:13.0216019Z         else:
2025-05-07T20:33:13.0216111Z             scale_ub_tensor = None
2025-05-07T20:33:13.0216186Z     
2025-05-07T20:33:13.0216312Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0216398Z             op = silu_mul_quant
2025-05-07T20:33:13.0216486Z             if compiled:
2025-05-07T20:33:13.0216582Z                 op = torch.compile(op)
2025-05-07T20:33:13.0216685Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0216751Z     
2025-05-07T20:33:13.0216836Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0216841Z 
2025-05-07T20:33:13.0216935Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0217058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0217153Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0217259Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0217620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:13.0217707Z     return fn(*args, **kwargs)
2025-05-07T20:33:13.0218190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0218282Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0218633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0218848Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0219178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0219272Z     kernel = self.compile(
2025-05-07T20:33:13.0219651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0219867Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0219989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0219993Z 
2025-05-07T20:33:13.0220191Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1c5d940>
2025-05-07T20:33:13.0220953Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0221447Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1c019e0>}
2025-05-07T20:33:13.0222226Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0222413Z context = <triton._C.libtriton.ir.context object at 0x7f03a1b1db30>
2025-05-07T20:33:13.0222418Z 
2025-05-07T20:33:13.0222577Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0222835Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0223001Z                            module_map=module_map)
2025-05-07T20:33:13.0223161Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0223255Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0223328Z E       ^
2025-05-07T20:33:13.0223677Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0223682Z 
2025-05-07T20:33:13.0224084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0224134Z 
2025-05-07T20:33:13.0224236Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0224453Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0224526Z     T=2048,
2025-05-07T20:33:13.0224602Z     D=5120,
2025-05-07T20:33:13.0224681Z     scale_ub=1200.0,
2025-05-07T20:33:13.0224763Z     contiguous=False,
2025-05-07T20:33:13.0224844Z     compiled=True,
2025-05-07T20:33:13.0224915Z )
2025-05-07T20:33:13.0225127Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0225297Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:13.0225302Z 
2025-05-07T20:33:13.0225377Z     @given(
2025-05-07T20:33:13.0225499Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0225592Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0225704Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0225823Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0225934Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0226005Z     )
2025-05-07T20:33:13.0226245Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0226331Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0226403Z         self,
2025-05-07T20:33:13.0226478Z         T: int,
2025-05-07T20:33:13.0226556Z         D: int,
2025-05-07T20:33:13.0226653Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0226740Z         contiguous: bool,
2025-05-07T20:33:13.0226822Z         compiled: bool,
2025-05-07T20:33:13.0226899Z     ) -> None:
2025-05-07T20:33:13.0226989Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0227060Z     
2025-05-07T20:33:13.0227225Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0227297Z     
2025-05-07T20:33:13.0227382Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0227512Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0227643Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0227718Z         x0 = x[:, :D]
2025-05-07T20:33:13.0227801Z         x1 = x[:, D:]
2025-05-07T20:33:13.0227868Z     
2025-05-07T20:33:13.0227947Z         if contiguous:
2025-05-07T20:33:13.0228035Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0228119Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0228190Z     
2025-05-07T20:33:13.0228278Z         if scale_ub is not None:
2025-05-07T20:33:13.0228380Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0228512Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0228582Z             )
2025-05-07T20:33:13.0228654Z         else:
2025-05-07T20:33:13.0228749Z             scale_ub_tensor = None
2025-05-07T20:33:13.0228818Z     
2025-05-07T20:33:13.0228941Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0229028Z             op = silu_mul_quant
2025-05-07T20:33:13.0229151Z             if compiled:
2025-05-07T20:33:13.0229252Z                 op = torch.compile(op)
2025-05-07T20:33:13.0229358Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0229426Z     
2025-05-07T20:33:13.0229514Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0229519Z 
2025-05-07T20:33:13.0229611Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0229734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0229872Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0229971Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0230375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:13.0230468Z     return fn(*args, **kwargs)
2025-05-07T20:33:13.0230949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0231047Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0231443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0231661Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0231998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0232087Z     kernel = self.compile(
2025-05-07T20:33:13.0232464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0232637Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0232759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0232763Z 
2025-05-07T20:33:13.0232965Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1c5da00>
2025-05-07T20:33:13.0233724Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0234223Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1c02b60>}
2025-05-07T20:33:13.0234950Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0235137Z context = <triton._C.libtriton.ir.context object at 0x7f03a1bd3770>
2025-05-07T20:33:13.0235142Z 
2025-05-07T20:33:13.0235304Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0235555Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0235662Z                            module_map=module_map)
2025-05-07T20:33:13.0235864Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0235957Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0236035Z E       ^
2025-05-07T20:33:13.0236380Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0236385Z 
2025-05-07T20:33:13.0236786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0236796Z 
2025-05-07T20:33:13.0236892Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0237108Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0237185Z     T=4096,
2025-05-07T20:33:13.0237255Z     D=5120,
2025-05-07T20:33:13.0237334Z     scale_ub=1200.0,
2025-05-07T20:33:13.0237420Z     contiguous=True,
2025-05-07T20:33:13.0237500Z     compiled=True,
2025-05-07T20:33:13.0237569Z )
2025-05-07T20:33:13.0237832Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0237998Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:13.0238002Z 
2025-05-07T20:33:13.0238077Z     @given(
2025-05-07T20:33:13.0238195Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0238291Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0238448Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0238559Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0238670Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0238741Z     )
2025-05-07T20:33:13.0238978Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0239068Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0239146Z         self,
2025-05-07T20:33:13.0239217Z         T: int,
2025-05-07T20:33:13.0239291Z         D: int,
2025-05-07T20:33:13.0239430Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0239520Z         contiguous: bool,
2025-05-07T20:33:13.0239603Z         compiled: bool,
2025-05-07T20:33:13.0239676Z     ) -> None:
2025-05-07T20:33:13.0239768Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0239840Z     
2025-05-07T20:33:13.0240005Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0240072Z     
2025-05-07T20:33:13.0240166Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0240291Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0240377Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0240470Z         x0 = x[:, :D]
2025-05-07T20:33:13.0240558Z         x1 = x[:, D:]
2025-05-07T20:33:13.0240640Z     
2025-05-07T20:33:13.0240734Z         if contiguous:
2025-05-07T20:33:13.0240822Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0240905Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0240973Z     
2025-05-07T20:33:13.0241061Z         if scale_ub is not None:
2025-05-07T20:33:13.0241170Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0241299Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0241368Z             )
2025-05-07T20:33:13.0241446Z         else:
2025-05-07T20:33:13.0241535Z             scale_ub_tensor = None
2025-05-07T20:33:13.0241604Z     
2025-05-07T20:33:13.0241734Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0241824Z             op = silu_mul_quant
2025-05-07T20:33:13.0241903Z             if compiled:
2025-05-07T20:33:13.0242004Z                 op = torch.compile(op)
2025-05-07T20:33:13.0242105Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0242174Z     
2025-05-07T20:33:13.0242263Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0242267Z 
2025-05-07T20:33:13.0242358Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0242487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0242588Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0242733Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0243095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:13.0243183Z     return fn(*args, **kwargs)
2025-05-07T20:33:13.0243662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0243763Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0244110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0244328Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0244657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0244744Z     kernel = self.compile(
2025-05-07T20:33:13.0245160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0245333Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0245459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0245464Z 
2025-05-07T20:33:13.0245663Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a19f1490>
2025-05-07T20:33:13.0246463Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0246957Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1978180>}
2025-05-07T20:33:13.0247726Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0247916Z context = <triton._C.libtriton.ir.context object at 0x7f03a19de5b0>
2025-05-07T20:33:13.0247921Z 
2025-05-07T20:33:13.0248078Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0248331Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0248440Z                            module_map=module_map)
2025-05-07T20:33:13.0248595Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0248695Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0248767Z E       ^
2025-05-07T20:33:13.0249111Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0249115Z 
2025-05-07T20:33:13.0249525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0249532Z 
2025-05-07T20:33:13.0249632Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0249851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0249926Z     T=128,
2025-05-07T20:33:13.0249998Z     D=5120,
2025-05-07T20:33:13.0250086Z     scale_ub=1200.0,
2025-05-07T20:33:13.0250188Z     contiguous=False,
2025-05-07T20:33:13.0250276Z     compiled=True,
2025-05-07T20:33:13.0250361Z )
2025-05-07T20:33:13.0250573Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0250738Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:13.0250742Z 
2025-05-07T20:33:13.0250820Z     @given(
2025-05-07T20:33:13.0250934Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0251028Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0251142Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0251303Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0251417Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0251491Z     )
2025-05-07T20:33:13.0251730Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0251827Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0251903Z         self,
2025-05-07T20:33:13.0251979Z         T: int,
2025-05-07T20:33:13.0252056Z         D: int,
2025-05-07T20:33:13.0252153Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0252237Z         contiguous: bool,
2025-05-07T20:33:13.0252320Z         compiled: bool,
2025-05-07T20:33:13.0252396Z     ) -> None:
2025-05-07T20:33:13.0252489Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0252564Z     
2025-05-07T20:33:13.0252727Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0252798Z     
2025-05-07T20:33:13.0252954Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0253139Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0253227Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0253304Z         x0 = x[:, :D]
2025-05-07T20:33:13.0253381Z         x1 = x[:, D:]
2025-05-07T20:33:13.0253450Z     
2025-05-07T20:33:13.0253532Z         if contiguous:
2025-05-07T20:33:13.0253619Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0253709Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0253824Z     
2025-05-07T20:33:13.0253910Z         if scale_ub is not None:
2025-05-07T20:33:13.0254014Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0254141Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0254220Z             )
2025-05-07T20:33:13.0254291Z         else:
2025-05-07T20:33:13.0254382Z             scale_ub_tensor = None
2025-05-07T20:33:13.0254454Z     
2025-05-07T20:33:13.0254580Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0254713Z             op = silu_mul_quant
2025-05-07T20:33:13.0254799Z             if compiled:
2025-05-07T20:33:13.0254896Z                 op = torch.compile(op)
2025-05-07T20:33:13.0254996Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0255067Z     
2025-05-07T20:33:13.0255154Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0255158Z 
2025-05-07T20:33:13.0255249Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0255379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0255475Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0255573Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0255933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:13.0256019Z     return fn(*args, **kwargs)
2025-05-07T20:33:13.0256503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0256602Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0256948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0257168Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0257497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0257593Z     kernel = self.compile(
2025-05-07T20:33:13.0257964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0258133Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0258257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0258261Z 
2025-05-07T20:33:13.0258458Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a19f03e0>
2025-05-07T20:33:13.0259400Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0260060Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1978ea0>}
2025-05-07T20:33:13.0260839Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0261025Z context = <triton._C.libtriton.ir.context object at 0x7f03a1755b70>
2025-05-07T20:33:13.0261030Z 
2025-05-07T20:33:13.0261191Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0261519Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0261629Z                            module_map=module_map)
2025-05-07T20:33:13.0261785Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0261882Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0261956Z E       ^
2025-05-07T20:33:13.0262302Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0262367Z 
2025-05-07T20:33:13.0262770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0262775Z 
2025-05-07T20:33:13.0262874Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0263093Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0263164Z     T=16384,
2025-05-07T20:33:13.0263240Z     D=7168,
2025-05-07T20:33:13.0263319Z     scale_ub=1200.0,
2025-05-07T20:33:13.0263398Z     contiguous=True,
2025-05-07T20:33:13.0263543Z     compiled=True,
2025-05-07T20:33:13.0263617Z )
2025-05-07T20:33:13.0263829Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0264000Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:13.0264005Z 
2025-05-07T20:33:13.0264077Z     @given(
2025-05-07T20:33:13.0264192Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0264295Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0264404Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0264519Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0264627Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0264701Z     )
2025-05-07T20:33:13.0264946Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0265033Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0265107Z         self,
2025-05-07T20:33:13.0265186Z         T: int,
2025-05-07T20:33:13.0265262Z         D: int,
2025-05-07T20:33:13.0265358Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0265451Z         contiguous: bool,
2025-05-07T20:33:13.0265532Z         compiled: bool,
2025-05-07T20:33:13.0265605Z     ) -> None:
2025-05-07T20:33:13.0265700Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0265772Z     
2025-05-07T20:33:13.0265935Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0266013Z     
2025-05-07T20:33:13.0266098Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0266221Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0266304Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0266381Z         x0 = x[:, :D]
2025-05-07T20:33:13.0266459Z         x1 = x[:, D:]
2025-05-07T20:33:13.0266528Z     
2025-05-07T20:33:13.0266606Z         if contiguous:
2025-05-07T20:33:13.0266693Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0266777Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0266903Z     
2025-05-07T20:33:13.0266993Z         if scale_ub is not None:
2025-05-07T20:33:13.0267095Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0267224Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0267300Z             )
2025-05-07T20:33:13.0267374Z         else:
2025-05-07T20:33:13.0267469Z             scale_ub_tensor = None
2025-05-07T20:33:13.0267535Z     
2025-05-07T20:33:13.0267661Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0267752Z             op = silu_mul_quant
2025-05-07T20:33:13.0267833Z             if compiled:
2025-05-07T20:33:13.0267927Z                 op = torch.compile(op)
2025-05-07T20:33:13.0268031Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0268101Z     
2025-05-07T20:33:13.0268187Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0268191Z 
2025-05-07T20:33:13.0268287Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0268456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0268560Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0268654Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0269018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:13.0269108Z     return fn(*args, **kwargs)
2025-05-07T20:33:13.0269591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0269724Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0270074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0270290Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0270626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0270757Z     kernel = self.compile(
2025-05-07T20:33:13.0271133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0271304Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0271426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0271431Z 
2025-05-07T20:33:13.0271635Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a19f2510>
2025-05-07T20:33:13.0272391Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0272883Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a197a0c0>}
2025-05-07T20:33:13.0273619Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0273806Z context = <triton._C.libtriton.ir.context object at 0x7f0552a7d630>
2025-05-07T20:33:13.0273810Z 
2025-05-07T20:33:13.0273971Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0274227Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0274329Z                            module_map=module_map)
2025-05-07T20:33:13.0274486Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0274579Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0274650Z E       ^
2025-05-07T20:33:13.0274997Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0275001Z 
2025-05-07T20:33:13.0275450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0275454Z 
2025-05-07T20:33:13.0275555Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0275771Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0275842Z     T=16384,
2025-05-07T20:33:13.0275915Z     D=5120,
2025-05-07T20:33:13.0275999Z     scale_ub=1200.0,
2025-05-07T20:33:13.0276078Z     contiguous=True,
2025-05-07T20:33:13.0276165Z     compiled=False,
2025-05-07T20:33:13.0276233Z )
2025-05-07T20:33:13.0279580Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0279775Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:13.0279780Z 
2025-05-07T20:33:13.0279859Z     @given(
2025-05-07T20:33:13.0279974Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0280138Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0280279Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0280402Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0280529Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0280601Z     )
2025-05-07T20:33:13.0280842Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0280980Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0281052Z         self,
2025-05-07T20:33:13.0281123Z         T: int,
2025-05-07T20:33:13.0281198Z         D: int,
2025-05-07T20:33:13.0281294Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0281381Z         contiguous: bool,
2025-05-07T20:33:13.0281468Z         compiled: bool,
2025-05-07T20:33:13.0281545Z     ) -> None:
2025-05-07T20:33:13.0281637Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0281709Z     
2025-05-07T20:33:13.0281876Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0281959Z     
2025-05-07T20:33:13.0282092Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0282215Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0282301Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0282377Z         x0 = x[:, :D]
2025-05-07T20:33:13.0282450Z         x1 = x[:, D:]
2025-05-07T20:33:13.0282524Z     
2025-05-07T20:33:13.0282603Z         if contiguous:
2025-05-07T20:33:13.0282692Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0282780Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0282848Z     
2025-05-07T20:33:13.0282936Z         if scale_ub is not None:
2025-05-07T20:33:13.0283040Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0283171Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0283242Z             )
2025-05-07T20:33:13.0283316Z         else:
2025-05-07T20:33:13.0283406Z             scale_ub_tensor = None
2025-05-07T20:33:13.0283478Z     
2025-05-07T20:33:13.0283608Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0283698Z             op = silu_mul_quant
2025-05-07T20:33:13.0283782Z             if compiled:
2025-05-07T20:33:13.0283877Z                 op = torch.compile(op)
2025-05-07T20:33:13.0283976Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0284048Z     
2025-05-07T20:33:13.0284134Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0284139Z 
2025-05-07T20:33:13.0284235Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0284363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0284461Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0284559Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0285048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0285143Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0285499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0285786Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0286121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0286213Z     kernel = self.compile(
2025-05-07T20:33:13.0286585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0286760Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0286883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0286887Z 
2025-05-07T20:33:13.0287086Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a19f2390>
2025-05-07T20:33:13.0287890Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0288385Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1979a80>}
2025-05-07T20:33:13.0289114Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0289361Z context = <triton._C.libtriton.ir.context object at 0x7f0552acb830>
2025-05-07T20:33:13.0289366Z 
2025-05-07T20:33:13.0289528Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0289781Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0289884Z                            module_map=module_map)
2025-05-07T20:33:13.0290088Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0290187Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0290264Z E       ^
2025-05-07T20:33:13.0290657Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0290663Z 
2025-05-07T20:33:13.0291068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0291079Z 
2025-05-07T20:33:13.0291181Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0291397Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0291469Z     T=1,
2025-05-07T20:33:13.0291541Z     D=7168,
2025-05-07T20:33:13.0291619Z     scale_ub=1200.0,
2025-05-07T20:33:13.0291704Z     contiguous=False,
2025-05-07T20:33:13.0291784Z     compiled=False,
2025-05-07T20:33:13.0291855Z )
2025-05-07T20:33:13.0292072Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0292241Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:13.0292245Z 
2025-05-07T20:33:13.0292318Z     @given(
2025-05-07T20:33:13.0292440Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0292536Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0292649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0292769Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0292879Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0292951Z     )
2025-05-07T20:33:13.0293304Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0293394Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0293468Z         self,
2025-05-07T20:33:13.0293539Z         T: int,
2025-05-07T20:33:13.0293610Z         D: int,
2025-05-07T20:33:13.0293709Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0293848Z         contiguous: bool,
2025-05-07T20:33:13.0293966Z         compiled: bool,
2025-05-07T20:33:13.0294075Z     ) -> None:
2025-05-07T20:33:13.0294198Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0294290Z     
2025-05-07T20:33:13.0294518Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0294618Z     
2025-05-07T20:33:13.0294718Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0294849Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0294932Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0295016Z         x0 = x[:, :D]
2025-05-07T20:33:13.0295095Z         x1 = x[:, D:]
2025-05-07T20:33:13.0295174Z     
2025-05-07T20:33:13.0295261Z         if contiguous:
2025-05-07T20:33:13.0295378Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0295497Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0295594Z     
2025-05-07T20:33:13.0295681Z         if scale_ub is not None:
2025-05-07T20:33:13.0295849Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0295989Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0296062Z             )
2025-05-07T20:33:13.0296133Z         else:
2025-05-07T20:33:13.0296226Z             scale_ub_tensor = None
2025-05-07T20:33:13.0296297Z     
2025-05-07T20:33:13.0296423Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0296510Z             op = silu_mul_quant
2025-05-07T20:33:13.0296637Z             if compiled:
2025-05-07T20:33:13.0296734Z                 op = torch.compile(op)
2025-05-07T20:33:13.0296834Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0296903Z     
2025-05-07T20:33:13.0296993Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0296998Z 
2025-05-07T20:33:13.0297089Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0297213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0297315Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0297453Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0297944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0298040Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0298391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0298617Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0298946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0299036Z     kernel = self.compile(
2025-05-07T20:33:13.0299410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0299580Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0299712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0299718Z 
2025-05-07T20:33:13.0299917Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552abda90>
2025-05-07T20:33:13.0300680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0301177Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552a400e0>}
2025-05-07T20:33:13.0301903Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0302090Z context = <triton._C.libtriton.ir.context object at 0x7f0552a61c30>
2025-05-07T20:33:13.0302138Z 
2025-05-07T20:33:13.0302305Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0302557Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0302664Z                            module_map=module_map)
2025-05-07T20:33:13.0302823Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0302921Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0302998Z E       ^
2025-05-07T20:33:13.0303341Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0303345Z 
2025-05-07T20:33:13.0303750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0303755Z 
2025-05-07T20:33:13.0303853Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0304112Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0304192Z     T=4096,
2025-05-07T20:33:13.0304264Z     D=7168,
2025-05-07T20:33:13.0304348Z     scale_ub=1200.0,
2025-05-07T20:33:13.0304431Z     contiguous=False,
2025-05-07T20:33:13.0304509Z     compiled=True,
2025-05-07T20:33:13.0304579Z )
2025-05-07T20:33:13.0304789Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0304959Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:13.0305005Z 
2025-05-07T20:33:13.0305086Z     @given(
2025-05-07T20:33:13.0305201Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0305301Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0305411Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0305523Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0305635Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0305705Z     )
2025-05-07T20:33:13.0305986Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0306082Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0306158Z         self,
2025-05-07T20:33:13.0306232Z         T: int,
2025-05-07T20:33:13.0306310Z         D: int,
2025-05-07T20:33:13.0306407Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0306492Z         contiguous: bool,
2025-05-07T20:33:13.0306576Z         compiled: bool,
2025-05-07T20:33:13.0306654Z     ) -> None:
2025-05-07T20:33:13.0306747Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0306814Z     
2025-05-07T20:33:13.0306977Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0307051Z     
2025-05-07T20:33:13.0307139Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0307262Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0307353Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0307431Z         x0 = x[:, :D]
2025-05-07T20:33:13.0307508Z         x1 = x[:, D:]
2025-05-07T20:33:13.0307583Z     
2025-05-07T20:33:13.0307663Z         if contiguous:
2025-05-07T20:33:13.0307749Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0307836Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0307906Z     
2025-05-07T20:33:13.0307993Z         if scale_ub is not None:
2025-05-07T20:33:13.0308099Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0308228Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0308307Z             )
2025-05-07T20:33:13.0308379Z         else:
2025-05-07T20:33:13.0308470Z             scale_ub_tensor = None
2025-05-07T20:33:13.0308543Z     
2025-05-07T20:33:13.0308666Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0308752Z             op = silu_mul_quant
2025-05-07T20:33:13.0308837Z             if compiled:
2025-05-07T20:33:13.0308934Z                 op = torch.compile(op)
2025-05-07T20:33:13.0309032Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0309103Z     
2025-05-07T20:33:13.0309315Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0309320Z 
2025-05-07T20:33:13.0309415Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0309538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0309634Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0309732Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0310094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:13.0310193Z     return fn(*args, **kwargs)
2025-05-07T20:33:13.0310719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0310812Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0311166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0311425Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0311762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0311856Z     kernel = self.compile(
2025-05-07T20:33:13.0312225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0312393Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0312564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0312569Z 
2025-05-07T20:33:13.0312766Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552abd640>
2025-05-07T20:33:13.0313523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0314081Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552a41300>}
2025-05-07T20:33:13.0314812Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0314999Z context = <triton._C.libtriton.ir.context object at 0x7f03a182caf0>
2025-05-07T20:33:13.0315004Z 
2025-05-07T20:33:13.0315164Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0315421Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0315523Z                            module_map=module_map)
2025-05-07T20:33:13.0315683Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0315777Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0315858Z E       ^
2025-05-07T20:33:13.0316206Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0316211Z 
2025-05-07T20:33:13.0316612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0316616Z 
2025-05-07T20:33:13.0316713Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0316936Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0317008Z     T=128,
2025-05-07T20:33:13.0317083Z     D=7168,
2025-05-07T20:33:13.0317161Z     scale_ub=1200.0,
2025-05-07T20:33:13.0317242Z     contiguous=False,
2025-05-07T20:33:13.0317325Z     compiled=True,
2025-05-07T20:33:13.0317393Z )
2025-05-07T20:33:13.0317604Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0317774Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:13.0317825Z 
2025-05-07T20:33:13.0317899Z     @given(
2025-05-07T20:33:13.0318015Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0318111Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0318222Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0318338Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0318446Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0318519Z     )
2025-05-07T20:33:13.0318761Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0318850Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0318923Z         self,
2025-05-07T20:33:13.0318997Z         T: int,
2025-05-07T20:33:13.0319069Z         D: int,
2025-05-07T20:33:13.0319164Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0319253Z         contiguous: bool,
2025-05-07T20:33:13.0319334Z         compiled: bool,
2025-05-07T20:33:13.0319448Z     ) -> None:
2025-05-07T20:33:13.0319549Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0319619Z     
2025-05-07T20:33:13.0319786Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0319853Z     
2025-05-07T20:33:13.0319943Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0320074Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0320160Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0320295Z         x0 = x[:, :D]
2025-05-07T20:33:13.0320384Z         x1 = x[:, D:]
2025-05-07T20:33:13.0320466Z     
2025-05-07T20:33:13.0320558Z         if contiguous:
2025-05-07T20:33:13.0320647Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0320733Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0320799Z     
2025-05-07T20:33:13.0320888Z         if scale_ub is not None:
2025-05-07T20:33:13.0320990Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0321121Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0321234Z             )
2025-05-07T20:33:13.0321309Z         else:
2025-05-07T20:33:13.0321404Z             scale_ub_tensor = None
2025-05-07T20:33:13.0321474Z     
2025-05-07T20:33:13.0321598Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0321687Z             op = silu_mul_quant
2025-05-07T20:33:13.0321768Z             if compiled:
2025-05-07T20:33:13.0321863Z                 op = torch.compile(op)
2025-05-07T20:33:13.0321969Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0322039Z     
2025-05-07T20:33:13.0322125Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0322129Z 
2025-05-07T20:33:13.0322228Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0322350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0322451Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0322546Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0322910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:13.0323003Z     return fn(*args, **kwargs)
2025-05-07T20:33:13.0323485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0323579Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0323930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0324149Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0324481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0324572Z     kernel = self.compile(
2025-05-07T20:33:13.0324941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0325117Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0325287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0325291Z 
2025-05-07T20:33:13.0325495Z self = <triton.compiler.compiler.ASTSource object at 0x7f0552abd1c0>
2025-05-07T20:33:13.0326248Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0326741Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552a42020>}
2025-05-07T20:33:13.0327467Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0327691Z context = <triton._C.libtriton.ir.context object at 0x7f03a180bd30>
2025-05-07T20:33:13.0327698Z 
2025-05-07T20:33:13.0327862Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0328115Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0328221Z                            module_map=module_map)
2025-05-07T20:33:13.0328382Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0328520Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0328593Z E       ^
2025-05-07T20:33:13.0328939Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0328944Z 
2025-05-07T20:33:13.0329343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0329348Z 
2025-05-07T20:33:13.0329446Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0329706Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0329782Z     T=2048,
2025-05-07T20:33:13.0329859Z     D=7168,
2025-05-07T20:33:13.0329937Z     scale_ub=None,
2025-05-07T20:33:13.0330023Z     contiguous=True,
2025-05-07T20:33:13.0330104Z     compiled=True,
2025-05-07T20:33:13.0330176Z )
2025-05-07T20:33:13.0330387Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0330556Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:13.0330561Z 
2025-05-07T20:33:13.0330636Z     @given(
2025-05-07T20:33:13.0330751Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0330850Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0330958Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0331073Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0331181Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0331259Z     )
2025-05-07T20:33:13.0331503Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0331595Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0331665Z         self,
2025-05-07T20:33:13.0331742Z         T: int,
2025-05-07T20:33:13.0331816Z         D: int,
2025-05-07T20:33:13.0331911Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0331999Z         contiguous: bool,
2025-05-07T20:33:13.0332086Z         compiled: bool,
2025-05-07T20:33:13.0332159Z     ) -> None:
2025-05-07T20:33:13.0332252Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0332321Z     
2025-05-07T20:33:13.0332490Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0332562Z     
2025-05-07T20:33:13.0332649Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0332773Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0332856Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0332934Z         x0 = x[:, :D]
2025-05-07T20:33:13.0333119Z         x1 = x[:, D:]
2025-05-07T20:33:13.0333186Z     
2025-05-07T20:33:13.0333266Z         if contiguous:
2025-05-07T20:33:13.0333354Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0333440Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0333509Z     
2025-05-07T20:33:13.0333596Z         if scale_ub is not None:
2025-05-07T20:33:13.0333700Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0333830Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0333910Z             )
2025-05-07T20:33:13.0333983Z         else:
2025-05-07T20:33:13.0334074Z             scale_ub_tensor = None
2025-05-07T20:33:13.0334144Z     
2025-05-07T20:33:13.0334268Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0334355Z             op = silu_mul_quant
2025-05-07T20:33:13.0334436Z             if compiled:
2025-05-07T20:33:13.0334530Z                 op = torch.compile(op)
2025-05-07T20:33:13.0334679Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0334752Z     
2025-05-07T20:33:13.0334839Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0334843Z 
2025-05-07T20:33:13.0334938Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0335061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0335165Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0335262Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0335667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:13.0335758Z     return fn(*args, **kwargs)
2025-05-07T20:33:13.0336241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0336336Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0336688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0336947Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0337278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0337369Z     kernel = self.compile(
2025-05-07T20:33:13.0337740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0337916Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0338037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0338042Z 
2025-05-07T20:33:13.0338241Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a17505c0>
2025-05-07T20:33:13.0339002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0339495Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0552a43240>}
2025-05-07T20:33:13.0340252Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0340465Z context = <triton._C.libtriton.ir.context object at 0x7f03a1740c70>
2025-05-07T20:33:13.0340469Z 
2025-05-07T20:33:13.0340633Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0340886Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0340990Z                            module_map=module_map)
2025-05-07T20:33:13.0341148Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0341288Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0341362Z E       ^
2025-05-07T20:33:13.0341711Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0341715Z 
2025-05-07T20:33:13.0342115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0342122Z 
2025-05-07T20:33:13.0342224Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0342439Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0342513Z     T=16384,
2025-05-07T20:33:13.0342591Z     D=5120,
2025-05-07T20:33:13.0342670Z     scale_ub=None,
2025-05-07T20:33:13.0342752Z     contiguous=False,
2025-05-07T20:33:13.0342836Z     compiled=False,
2025-05-07T20:33:13.0342906Z )
2025-05-07T20:33:13.0343118Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0343336Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:13.0343344Z 
2025-05-07T20:33:13.0343415Z     @given(
2025-05-07T20:33:13.0343533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0343630Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0343741Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0343855Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0344028Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0344099Z     )
2025-05-07T20:33:13.0344342Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0344431Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0344507Z         self,
2025-05-07T20:33:13.0344581Z         T: int,
2025-05-07T20:33:13.0344655Z         D: int,
2025-05-07T20:33:13.0344753Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0344838Z         contiguous: bool,
2025-05-07T20:33:13.0344964Z         compiled: bool,
2025-05-07T20:33:13.0345044Z     ) -> None:
2025-05-07T20:33:13.0345134Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0345203Z     
2025-05-07T20:33:13.0345367Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0345435Z     
2025-05-07T20:33:13.0345520Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0345645Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0347426Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0347434Z 
2025-05-07T20:33:13.0347554Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:13.0347559Z 
2025-05-07T20:33:13.0347656Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0347877Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0347947Z     T=4096,
2025-05-07T20:33:13.0348019Z     D=7168,
2025-05-07T20:33:13.0348103Z     scale_ub=1200.0,
2025-05-07T20:33:13.0348181Z     contiguous=True,
2025-05-07T20:33:13.0348259Z     compiled=True,
2025-05-07T20:33:13.0348336Z )
2025-05-07T20:33:13.0348546Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0348710Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:13.0348715Z 
2025-05-07T20:33:13.0348790Z     @given(
2025-05-07T20:33:13.0348903Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0349003Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0349157Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0349270Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0349382Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0349450Z     )
2025-05-07T20:33:13.0349688Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0349779Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0349854Z         self,
2025-05-07T20:33:13.0349924Z         T: int,
2025-05-07T20:33:13.0349999Z         D: int,
2025-05-07T20:33:13.0350096Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0350180Z         contiguous: bool,
2025-05-07T20:33:13.0350268Z         compiled: bool,
2025-05-07T20:33:13.0350358Z     ) -> None:
2025-05-07T20:33:13.0350461Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0350549Z     
2025-05-07T20:33:13.0350714Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0350834Z     
2025-05-07T20:33:13.0350927Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0351049Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0352806Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0352851Z 
2025-05-07T20:33:13.0352969Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:13.0352973Z 
2025-05-07T20:33:13.0353076Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0353335Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0353413Z     T=16384,
2025-05-07T20:33:13.0353489Z     D=7168,
2025-05-07T20:33:13.0353565Z     scale_ub=None,
2025-05-07T20:33:13.0353648Z     contiguous=False,
2025-05-07T20:33:13.0353728Z     compiled=False,
2025-05-07T20:33:13.0353798Z )
2025-05-07T20:33:13.0354012Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0354185Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:13.0354189Z 
2025-05-07T20:33:13.0354263Z     @given(
2025-05-07T20:33:13.0354380Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0354475Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0354586Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0354701Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0354811Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0354888Z     )
2025-05-07T20:33:13.0355135Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0355223Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0355301Z         self,
2025-05-07T20:33:13.0355375Z         T: int,
2025-05-07T20:33:13.0355446Z         D: int,
2025-05-07T20:33:13.0355548Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0355636Z         contiguous: bool,
2025-05-07T20:33:13.0355721Z         compiled: bool,
2025-05-07T20:33:13.0355799Z     ) -> None:
2025-05-07T20:33:13.0355891Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0355962Z     
2025-05-07T20:33:13.0356129Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0357887Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0357942Z 
2025-05-07T20:33:13.0358057Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:13.0358062Z 
2025-05-07T20:33:13.0358163Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0358382Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0358455Z     T=2048,
2025-05-07T20:33:13.0358530Z     D=7168,
2025-05-07T20:33:13.0358612Z     scale_ub=1200.0,
2025-05-07T20:33:13.0358694Z     contiguous=True,
2025-05-07T20:33:13.0358772Z     compiled=True,
2025-05-07T20:33:13.0358847Z )
2025-05-07T20:33:13.0359059Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0359507Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:13.0359529Z 
2025-05-07T20:33:13.0359644Z     @given(
2025-05-07T20:33:13.0359764Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0359863Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0359974Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0360087Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0360292Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0360365Z     )
2025-05-07T20:33:13.0360606Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0360698Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0360771Z         self,
2025-05-07T20:33:13.0360843Z         T: int,
2025-05-07T20:33:13.0360919Z         D: int,
2025-05-07T20:33:13.0361011Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0361101Z         contiguous: bool,
2025-05-07T20:33:13.0361184Z         compiled: bool,
2025-05-07T20:33:13.0361324Z     ) -> None:
2025-05-07T20:33:13.0361420Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0361493Z     
2025-05-07T20:33:13.0361656Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0361731Z     
2025-05-07T20:33:13.0361819Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0361941Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0363689Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0363701Z 
2025-05-07T20:33:13.0363813Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:13.0363818Z 
2025-05-07T20:33:13.0363918Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0364134Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0364211Z     T=2048,
2025-05-07T20:33:13.0364287Z     D=7168,
2025-05-07T20:33:13.0364362Z     scale_ub=None,
2025-05-07T20:33:13.0364447Z     contiguous=True,
2025-05-07T20:33:13.0364526Z     compiled=False,
2025-05-07T20:33:13.0364594Z )
2025-05-07T20:33:13.0364807Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0364970Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:13.0364974Z 
2025-05-07T20:33:13.0365048Z     @given(
2025-05-07T20:33:13.0365163Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0365260Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0365373Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0365547Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0365656Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0365733Z     )
2025-05-07T20:33:13.0365976Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0366067Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0366147Z         self,
2025-05-07T20:33:13.0366220Z         T: int,
2025-05-07T20:33:13.0366295Z         D: int,
2025-05-07T20:33:13.0366392Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0366478Z         contiguous: bool,
2025-05-07T20:33:13.0366560Z         compiled: bool,
2025-05-07T20:33:13.0366638Z     ) -> None:
2025-05-07T20:33:13.0366731Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0366805Z     
2025-05-07T20:33:13.0366968Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0367037Z     
2025-05-07T20:33:13.0367194Z >       x_sign = torch.sign(x)
2025-05-07T20:33:13.0368935Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0368986Z 
2025-05-07T20:33:13.0369105Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:13.0369109Z 
2025-05-07T20:33:13.0369207Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0369422Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0369500Z     T=1,
2025-05-07T20:33:13.0369579Z     D=7168,
2025-05-07T20:33:13.0369700Z     scale_ub=1200.0,
2025-05-07T20:33:13.0369789Z     contiguous=True,
2025-05-07T20:33:13.0369869Z     compiled=False,
2025-05-07T20:33:13.0369946Z )
2025-05-07T20:33:13.0370174Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0370357Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:13.0370362Z 
2025-05-07T20:33:13.0370454Z     @given(
2025-05-07T20:33:13.0370565Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0370659Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0370771Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0370883Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0370994Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0371070Z     )
2025-05-07T20:33:13.0371311Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0371409Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0371482Z         self,
2025-05-07T20:33:13.0371553Z         T: int,
2025-05-07T20:33:13.0371630Z         D: int,
2025-05-07T20:33:13.0371725Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0371808Z         contiguous: bool,
2025-05-07T20:33:13.0371894Z         compiled: bool,
2025-05-07T20:33:13.0371969Z     ) -> None:
2025-05-07T20:33:13.0372061Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0372137Z     
2025-05-07T20:33:13.0372298Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0372370Z     
2025-05-07T20:33:13.0372462Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0372582Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0372672Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0372750Z         x0 = x[:, :D]
2025-05-07T20:33:13.0372827Z         x1 = x[:, D:]
2025-05-07T20:33:13.0372904Z     
2025-05-07T20:33:13.0373049Z         if contiguous:
2025-05-07T20:33:13.0373192Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0373283Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0373353Z     
2025-05-07T20:33:13.0373439Z         if scale_ub is not None:
2025-05-07T20:33:13.0373545Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0373678Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0373752Z             )
2025-05-07T20:33:13.0373835Z         else:
2025-05-07T20:33:13.0373930Z             scale_ub_tensor = None
2025-05-07T20:33:13.0374001Z     
2025-05-07T20:33:13.0374133Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0374223Z             op = silu_mul_quant
2025-05-07T20:33:13.0374311Z             if compiled:
2025-05-07T20:33:13.0374408Z                 op = torch.compile(op)
2025-05-07T20:33:13.0374511Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0374585Z     
2025-05-07T20:33:13.0374674Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0374678Z 
2025-05-07T20:33:13.0374843Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0374974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0375075Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0375171Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0375667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0375802Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0376157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0376375Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0376712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0376806Z     kernel = self.compile(
2025-05-07T20:33:13.0377226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0377406Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0377532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0377536Z 
2025-05-07T20:33:13.0377738Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1753b00>
2025-05-07T20:33:13.0378508Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0379000Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1732520>}
2025-05-07T20:33:13.0379741Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0379932Z context = <triton._C.libtriton.ir.context object at 0x7f03a1a88c30>
2025-05-07T20:33:13.0379936Z 
2025-05-07T20:33:13.0380097Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0380362Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0380498Z                            module_map=module_map)
2025-05-07T20:33:13.0380682Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0380777Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0380851Z E       ^
2025-05-07T20:33:13.0381196Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0381200Z 
2025-05-07T20:33:13.0381606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0381654Z 
2025-05-07T20:33:13.0381758Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0381973Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0382045Z     T=128,
2025-05-07T20:33:13.0382121Z     D=5120,
2025-05-07T20:33:13.0382200Z     scale_ub=None,
2025-05-07T20:33:13.0382280Z     contiguous=True,
2025-05-07T20:33:13.0382366Z     compiled=False,
2025-05-07T20:33:13.0382433Z )
2025-05-07T20:33:13.0382643Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0382810Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:13.0382814Z 
2025-05-07T20:33:13.0382886Z     @given(
2025-05-07T20:33:13.0383004Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0383099Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0383251Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0383375Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0383487Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0383560Z     )
2025-05-07T20:33:13.0383806Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0383897Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0383969Z         self,
2025-05-07T20:33:13.0384085Z         T: int,
2025-05-07T20:33:13.0384157Z         D: int,
2025-05-07T20:33:13.0384251Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0384342Z         contiguous: bool,
2025-05-07T20:33:13.0384424Z         compiled: bool,
2025-05-07T20:33:13.0384501Z     ) -> None:
2025-05-07T20:33:13.0384595Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0384666Z     
2025-05-07T20:33:13.0384834Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0384908Z     
2025-05-07T20:33:13.0384998Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0385167Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0385255Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0385335Z         x0 = x[:, :D]
2025-05-07T20:33:13.0385420Z         x1 = x[:, D:]
2025-05-07T20:33:13.0385493Z     
2025-05-07T20:33:13.0385576Z         if contiguous:
2025-05-07T20:33:13.0385669Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0385754Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0385832Z     
2025-05-07T20:33:13.0385920Z         if scale_ub is not None:
2025-05-07T20:33:13.0386023Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0386156Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0386227Z             )
2025-05-07T20:33:13.0386303Z         else:
2025-05-07T20:33:13.0386394Z             scale_ub_tensor = None
2025-05-07T20:33:13.0386460Z     
2025-05-07T20:33:13.0386584Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0386674Z             op = silu_mul_quant
2025-05-07T20:33:13.0386759Z             if compiled:
2025-05-07T20:33:13.0386854Z                 op = torch.compile(op)
2025-05-07T20:33:13.0386957Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0387023Z     
2025-05-07T20:33:13.0387113Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0387117Z 
2025-05-07T20:33:13.0387209Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0387334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0387435Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0387529Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0388019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0388117Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0388465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0388735Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0389066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0389155Z     kernel = self.compile(
2025-05-07T20:33:13.0389529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0389702Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0389828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0389836Z 
2025-05-07T20:33:13.0390060Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1a25e50>
2025-05-07T20:33:13.0390888Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0391387Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1733420>}
2025-05-07T20:33:13.0392116Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0392346Z context = <triton._C.libtriton.ir.context object at 0x7f03a1ad92f0>
2025-05-07T20:33:13.0392350Z 
2025-05-07T20:33:13.0392508Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0392761Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0392870Z                            module_map=module_map)
2025-05-07T20:33:13.0393029Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0393166Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0393246Z E       ^
2025-05-07T20:33:13.0393591Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0393596Z 
2025-05-07T20:33:13.0394000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0394007Z 
2025-05-07T20:33:13.0394107Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0394323Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0394398Z     T=128,
2025-05-07T20:33:13.0394470Z     D=7168,
2025-05-07T20:33:13.0394552Z     scale_ub=None,
2025-05-07T20:33:13.0394633Z     contiguous=True,
2025-05-07T20:33:13.0394712Z     compiled=False,
2025-05-07T20:33:13.0394785Z )
2025-05-07T20:33:13.0394996Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0395163Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:13.0395170Z 
2025-05-07T20:33:13.0395244Z     @given(
2025-05-07T20:33:13.0395359Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0395455Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0395567Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0395679Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0395794Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0395864Z     )
2025-05-07T20:33:13.0396146Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0396279Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0396380Z         self,
2025-05-07T20:33:13.0399923Z         T: int,
2025-05-07T20:33:13.0400017Z         D: int,
2025-05-07T20:33:13.0400118Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0400211Z         contiguous: bool,
2025-05-07T20:33:13.0400301Z         compiled: bool,
2025-05-07T20:33:13.0400647Z     ) -> None:
2025-05-07T20:33:13.0400931Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0401197Z     
2025-05-07T20:33:13.0401523Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0401870Z     
2025-05-07T20:33:13.0402064Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0402356Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0402662Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0402903Z         x0 = x[:, :D]
2025-05-07T20:33:13.0403127Z         x1 = x[:, D:]
2025-05-07T20:33:13.0403330Z     
2025-05-07T20:33:13.0403516Z         if contiguous:
2025-05-07T20:33:13.0403764Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0404020Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0404255Z     
2025-05-07T20:33:13.0404472Z         if scale_ub is not None:
2025-05-07T20:33:13.0404743Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0405139Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0405458Z             )
2025-05-07T20:33:13.0405662Z         else:
2025-05-07T20:33:13.0405880Z             scale_ub_tensor = None
2025-05-07T20:33:13.0406132Z     
2025-05-07T20:33:13.0406364Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0406739Z             op = silu_mul_quant
2025-05-07T20:33:13.0407030Z             if compiled:
2025-05-07T20:33:13.0407326Z                 op = torch.compile(op)
2025-05-07T20:33:13.0407697Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0407970Z     
2025-05-07T20:33:13.0408151Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0408311Z 
2025-05-07T20:33:13.0408409Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0408693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0409019Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0409287Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0410015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0410748Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0411281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0411946Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0412597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0413208Z     kernel = self.compile(
2025-05-07T20:33:13.0413736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0414369Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0414751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0414975Z 
2025-05-07T20:33:13.0415187Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1a262d0>
2025-05-07T20:33:13.0416244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0417601Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1a204a0>}
2025-05-07T20:33:13.0418909Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0419930Z context = <triton._C.libtriton.ir.context object at 0x7f03a1a73370>
2025-05-07T20:33:13.0420244Z 
2025-05-07T20:33:13.0420425Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0421004Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0421454Z                            module_map=module_map)
2025-05-07T20:33:13.0421811Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0422158Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0422404Z E       ^
2025-05-07T20:33:13.0422860Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0423307Z 
2025-05-07T20:33:13.0423714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0424212Z 
2025-05-07T20:33:13.0424316Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0424712Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0425099Z     T=2048,
2025-05-07T20:33:13.0425328Z     D=7168,
2025-05-07T20:33:13.0425515Z     scale_ub=1200.0,
2025-05-07T20:33:13.0425730Z     contiguous=True,
2025-05-07T20:33:13.0425939Z     compiled=False,
2025-05-07T20:33:13.0426132Z )
2025-05-07T20:33:13.0426442Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0426923Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:13.0427187Z 
2025-05-07T20:33:13.0427308Z     @given(
2025-05-07T20:33:13.0427526Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0427831Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0428131Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0428443Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0428763Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0429038Z     )
2025-05-07T20:33:13.0429374Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0429862Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0430120Z         self,
2025-05-07T20:33:13.0430326Z         T: int,
2025-05-07T20:33:13.0430511Z         D: int,
2025-05-07T20:33:13.0430721Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0430980Z         contiguous: bool,
2025-05-07T20:33:13.0431206Z         compiled: bool,
2025-05-07T20:33:13.0431418Z     ) -> None:
2025-05-07T20:33:13.0431622Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0431851Z     
2025-05-07T20:33:13.0432111Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0434129Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0435956Z 
2025-05-07T20:33:13.0436076Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:13.0436284Z 
2025-05-07T20:33:13.0436383Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0436775Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0437164Z     T=1,
2025-05-07T20:33:13.0437338Z     D=5120,
2025-05-07T20:33:13.0437513Z     scale_ub=1200.0,
2025-05-07T20:33:13.0437725Z     contiguous=True,
2025-05-07T20:33:13.0437940Z     compiled=False,
2025-05-07T20:33:13.0438129Z )
2025-05-07T20:33:13.0438435Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0438906Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:13.0439162Z 
2025-05-07T20:33:13.0439232Z     @given(
2025-05-07T20:33:13.0439506Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0439809Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0440108Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0440476Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0440790Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0441064Z     )
2025-05-07T20:33:13.0441401Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0441826Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0442057Z         self,
2025-05-07T20:33:13.0442237Z         T: int,
2025-05-07T20:33:13.0442423Z         D: int,
2025-05-07T20:33:13.0442631Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0442889Z         contiguous: bool,
2025-05-07T20:33:13.0443125Z         compiled: bool,
2025-05-07T20:33:13.0443339Z     ) -> None:
2025-05-07T20:33:13.0443543Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0443830Z     
2025-05-07T20:33:13.0444095Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0444424Z     
2025-05-07T20:33:13.0444608Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0444891Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0445190Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0445422Z         x0 = x[:, :D]
2025-05-07T20:33:13.0445635Z         x1 = x[:, D:]
2025-05-07T20:33:13.0445877Z     
2025-05-07T20:33:13.0446054Z         if contiguous:
2025-05-07T20:33:13.0446277Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0446524Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0446756Z     
2025-05-07T20:33:13.0446942Z         if scale_ub is not None:
2025-05-07T20:33:13.0447206Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0447527Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0447828Z             )
2025-05-07T20:33:13.0448018Z         else:
2025-05-07T20:33:13.0448266Z             scale_ub_tensor = None
2025-05-07T20:33:13.0448507Z     
2025-05-07T20:33:13.0448731Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0449035Z             op = silu_mul_quant
2025-05-07T20:33:13.0449274Z             if compiled:
2025-05-07T20:33:13.0449512Z                 op = torch.compile(op)
2025-05-07T20:33:13.0449800Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0450061Z     
2025-05-07T20:33:13.0450245Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0450405Z 
2025-05-07T20:33:13.0450501Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0450785Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0451104Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0451368Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0452041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0452719Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0453342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0454012Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0454658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0455179Z     kernel = self.compile(
2025-05-07T20:33:13.0455711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0456346Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0456731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0456954Z 
2025-05-07T20:33:13.0457159Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a1a26ba0>
2025-05-07T20:33:13.0458220Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0459941Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1a21a80>}
2025-05-07T20:33:13.0461263Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0462264Z context = <triton._C.libtriton.ir.context object at 0x7f03a16ac170>
2025-05-07T20:33:13.0462542Z 
2025-05-07T20:33:13.0462707Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0463306Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0463767Z                            module_map=module_map)
2025-05-07T20:33:13.0464123Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0464469Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0464722Z E       ^
2025-05-07T20:33:13.0465176Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0465679Z 
2025-05-07T20:33:13.0466091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0466590Z 
2025-05-07T20:33:13.0466693Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0467098Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0467489Z     T=2048,
2025-05-07T20:33:13.0467675Z     D=5120,
2025-05-07T20:33:13.0467859Z     scale_ub=None,
2025-05-07T20:33:13.0468068Z     contiguous=True,
2025-05-07T20:33:13.0468356Z     compiled=False,
2025-05-07T20:33:13.0468559Z )
2025-05-07T20:33:13.0468872Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0469351Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:13.0469619Z 
2025-05-07T20:33:13.0469696Z     @given(
2025-05-07T20:33:13.0469917Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0470229Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0470525Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0470845Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0471168Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0471443Z     )
2025-05-07T20:33:13.0471782Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0472214Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0472449Z         self,
2025-05-07T20:33:13.0472644Z         T: int,
2025-05-07T20:33:13.0472844Z         D: int,
2025-05-07T20:33:13.0473054Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0473320Z         contiguous: bool,
2025-05-07T20:33:13.0473553Z         compiled: bool,
2025-05-07T20:33:13.0473776Z     ) -> None:
2025-05-07T20:33:13.0473981Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0474221Z     
2025-05-07T20:33:13.0474484Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0474818Z     
2025-05-07T20:33:13.0475011Z >       x_sign = torch.sign(x)
2025-05-07T20:33:13.0476936Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0478827Z 
2025-05-07T20:33:13.0478946Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:13.0479152Z 
2025-05-07T20:33:13.0479253Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0479655Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0480053Z     T=16384,
2025-05-07T20:33:13.0480269Z     D=5120,
2025-05-07T20:33:13.0480480Z     scale_ub=None,
2025-05-07T20:33:13.0480687Z     contiguous=True,
2025-05-07T20:33:13.0480904Z     compiled=False,
2025-05-07T20:33:13.0481101Z )
2025-05-07T20:33:13.0481411Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0481893Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:13.0482162Z 
2025-05-07T20:33:13.0482238Z     @given(
2025-05-07T20:33:13.0482508Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0482818Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0483114Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0483435Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0483754Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0484032Z     )
2025-05-07T20:33:13.0484371Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0484843Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0485076Z         self,
2025-05-07T20:33:13.0485263Z         T: int,
2025-05-07T20:33:13.0485457Z         D: int,
2025-05-07T20:33:13.0485669Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0485927Z         contiguous: bool,
2025-05-07T20:33:13.0486160Z         compiled: bool,
2025-05-07T20:33:13.0486378Z     ) -> None:
2025-05-07T20:33:13.0486581Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0486815Z     
2025-05-07T20:33:13.0487215Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0489220Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0491043Z 
2025-05-07T20:33:13.0491162Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:13.0491370Z 
2025-05-07T20:33:13.0491469Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0491873Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0492265Z     T=4096,
2025-05-07T20:33:13.0492449Z     D=5120,
2025-05-07T20:33:13.0492638Z     scale_ub=None,
2025-05-07T20:33:13.0492851Z     contiguous=True,
2025-05-07T20:33:13.0493116Z     compiled=False,
2025-05-07T20:33:13.0493312Z )
2025-05-07T20:33:13.0493622Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0494103Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:13.0494368Z 
2025-05-07T20:33:13.0494443Z     @given(
2025-05-07T20:33:13.0494667Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0494971Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0495266Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0495585Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0495904Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0496175Z     )
2025-05-07T20:33:13.0496521Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0497002Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0497232Z         self,
2025-05-07T20:33:13.0497424Z         T: int,
2025-05-07T20:33:13.0497615Z         D: int,
2025-05-07T20:33:13.0497827Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0498089Z         contiguous: bool,
2025-05-07T20:33:13.0498326Z         compiled: bool,
2025-05-07T20:33:13.0498546Z     ) -> None:
2025-05-07T20:33:13.0498754Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0498991Z     
2025-05-07T20:33:13.0499252Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0501291Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0503108Z 
2025-05-07T20:33:13.0503224Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:13.0503434Z 
2025-05-07T20:33:13.0503533Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0503976Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0504369Z     T=2048,
2025-05-07T20:33:13.0504546Z     D=5120,
2025-05-07T20:33:13.0504730Z     scale_ub=None,
2025-05-07T20:33:13.0504942Z     contiguous=False,
2025-05-07T20:33:13.0505162Z     compiled=False,
2025-05-07T20:33:13.0505365Z )
2025-05-07T20:33:13.0505676Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0506152Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:13.0506419Z 
2025-05-07T20:33:13.0506547Z     @given(
2025-05-07T20:33:13.0506772Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0507073Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0507375Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0507694Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0508011Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0508288Z     )
2025-05-07T20:33:13.0508630Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0509059Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0509291Z         self,
2025-05-07T20:33:13.0509481Z         T: int,
2025-05-07T20:33:13.0509678Z         D: int,
2025-05-07T20:33:13.0509891Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0510182Z         contiguous: bool,
2025-05-07T20:33:13.0510440Z         compiled: bool,
2025-05-07T20:33:13.0510654Z     ) -> None:
2025-05-07T20:33:13.0510869Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0511108Z     
2025-05-07T20:33:13.0511369Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0513366Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0515183Z 
2025-05-07T20:33:13.0515300Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:13.0515515Z 
2025-05-07T20:33:13.0515614Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0516022Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0516458Z     T=4096,
2025-05-07T20:33:13.0516648Z     D=7168,
2025-05-07T20:33:13.0516834Z     scale_ub=None,
2025-05-07T20:33:13.0517038Z     contiguous=True,
2025-05-07T20:33:13.0517254Z     compiled=True,
2025-05-07T20:33:13.0517453Z )
2025-05-07T20:33:13.0517759Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0518237Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:13.0518500Z 
2025-05-07T20:33:13.0518577Z     @given(
2025-05-07T20:33:13.0518799Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0519098Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0519393Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0519713Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0520029Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0520310Z     )
2025-05-07T20:33:13.0520735Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0521175Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0521410Z         self,
2025-05-07T20:33:13.0521598Z         T: int,
2025-05-07T20:33:13.0521791Z         D: int,
2025-05-07T20:33:13.0522001Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0522267Z         contiguous: bool,
2025-05-07T20:33:13.0522543Z         compiled: bool,
2025-05-07T20:33:13.0522758Z     ) -> None:
2025-05-07T20:33:13.0522969Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0523209Z     
2025-05-07T20:33:13.0523466Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0525506Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0527330Z 
2025-05-07T20:33:13.0527445Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:13.0527658Z 
2025-05-07T20:33:13.0527764Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0528167Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0528551Z     T=2048,
2025-05-07T20:33:13.0528735Z     D=5120,
2025-05-07T20:33:13.0528921Z     scale_ub=1200.0,
2025-05-07T20:33:13.0529134Z     contiguous=False,
2025-05-07T20:33:13.0529359Z     compiled=False,
2025-05-07T20:33:13.0529555Z )
2025-05-07T20:33:13.0529860Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0530346Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:13.0530618Z 
2025-05-07T20:33:13.0530697Z     @given(
2025-05-07T20:33:13.0530913Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0531215Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0531511Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0531828Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0532256Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0532533Z     )
2025-05-07T20:33:13.0532871Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0533344Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0533581Z         self,
2025-05-07T20:33:13.0533768Z         T: int,
2025-05-07T20:33:13.0533958Z         D: int,
2025-05-07T20:33:13.0534171Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0534434Z         contiguous: bool,
2025-05-07T20:33:13.0534663Z         compiled: bool,
2025-05-07T20:33:13.0534942Z     ) -> None:
2025-05-07T20:33:13.0535152Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0535382Z     
2025-05-07T20:33:13.0535646Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0537638Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0539451Z 
2025-05-07T20:33:13.0539566Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:13.0539770Z 
2025-05-07T20:33:13.0539917Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0540322Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0540712Z     T=4096,
2025-05-07T20:33:13.0540895Z     D=7168,
2025-05-07T20:33:13.0541078Z     scale_ub=1200.0,
2025-05-07T20:33:13.0541300Z     contiguous=True,
2025-05-07T20:33:13.0541517Z     compiled=False,
2025-05-07T20:33:13.0546173Z )
2025-05-07T20:33:13.0546505Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0547061Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:13.0547334Z 
2025-05-07T20:33:13.0547407Z     @given(
2025-05-07T20:33:13.0547636Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0547941Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0548242Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0548568Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0548944Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0549232Z     )
2025-05-07T20:33:13.0549582Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0550034Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0550292Z         self,
2025-05-07T20:33:13.0550474Z         T: int,
2025-05-07T20:33:13.0550665Z         D: int,
2025-05-07T20:33:13.0550875Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0551139Z         contiguous: bool,
2025-05-07T20:33:13.0551367Z         compiled: bool,
2025-05-07T20:33:13.0551587Z     ) -> None:
2025-05-07T20:33:13.0551798Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0552040Z     
2025-05-07T20:33:13.0552304Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0554319Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0556148Z 
2025-05-07T20:33:13.0556265Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:13.0556479Z 
2025-05-07T20:33:13.0556583Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0556990Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0557382Z     T=16384,
2025-05-07T20:33:13.0557569Z     D=7168,
2025-05-07T20:33:13.0557758Z     scale_ub=None,
2025-05-07T20:33:13.0557969Z     contiguous=False,
2025-05-07T20:33:13.0558192Z     compiled=True,
2025-05-07T20:33:13.0558385Z )
2025-05-07T20:33:13.0558696Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0559466Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:13.0559789Z 
2025-05-07T20:33:13.0559864Z     @given(
2025-05-07T20:33:13.0560092Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0560417Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0560736Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0561054Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0561369Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0561644Z     )
2025-05-07T20:33:13.0561980Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0562405Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0562636Z         self,
2025-05-07T20:33:13.0562818Z         T: int,
2025-05-07T20:33:13.0563003Z         D: int,
2025-05-07T20:33:13.0563212Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0563572Z         contiguous: bool,
2025-05-07T20:33:13.0563810Z         compiled: bool,
2025-05-07T20:33:13.0564015Z     ) -> None:
2025-05-07T20:33:13.0564222Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0564449Z     
2025-05-07T20:33:13.0564704Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0566715Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0568628Z 
2025-05-07T20:33:13.0568747Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:13.0569011Z 
2025-05-07T20:33:13.0569115Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0569508Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0569893Z     T=4096,
2025-05-07T20:33:13.0570073Z     D=7168,
2025-05-07T20:33:13.0570255Z     scale_ub=None,
2025-05-07T20:33:13.0570452Z     contiguous=True,
2025-05-07T20:33:13.0570665Z     compiled=False,
2025-05-07T20:33:13.0570862Z )
2025-05-07T20:33:13.0571166Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0571641Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:13.0571901Z 
2025-05-07T20:33:13.0571980Z     @given(
2025-05-07T20:33:13.0572192Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0572494Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0572784Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0573195Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0573517Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0573794Z     )
2025-05-07T20:33:13.0574131Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0574553Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0574782Z         self,
2025-05-07T20:33:13.0574966Z         T: int,
2025-05-07T20:33:13.0575152Z         D: int,
2025-05-07T20:33:13.0575362Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0575624Z         contiguous: bool,
2025-05-07T20:33:13.0575850Z         compiled: bool,
2025-05-07T20:33:13.0576062Z     ) -> None:
2025-05-07T20:33:13.0576268Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0576494Z     
2025-05-07T20:33:13.0576752Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0578756Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0580641Z 
2025-05-07T20:33:13.0580755Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:13.0580959Z 
2025-05-07T20:33:13.0581063Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0581454Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0581843Z     T=16384,
2025-05-07T20:33:13.0582030Z     D=7168,
2025-05-07T20:33:13.0582206Z     scale_ub=None,
2025-05-07T20:33:13.0582408Z     contiguous=True,
2025-05-07T20:33:13.0582617Z     compiled=False,
2025-05-07T20:33:13.0582813Z )
2025-05-07T20:33:13.0583171Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0583652Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:13.0583922Z 
2025-05-07T20:33:13.0584000Z     @given(
2025-05-07T20:33:13.0584211Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0584510Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0584850Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0585167Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0585489Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0585772Z     )
2025-05-07T20:33:13.0586101Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0586529Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0586758Z         self,
2025-05-07T20:33:13.0586938Z         T: int,
2025-05-07T20:33:13.0587124Z         D: int,
2025-05-07T20:33:13.0587381Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0587640Z         contiguous: bool,
2025-05-07T20:33:13.0587864Z         compiled: bool,
2025-05-07T20:33:13.0588079Z     ) -> None:
2025-05-07T20:33:13.0588281Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0588507Z     
2025-05-07T20:33:13.0588769Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0590822Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0592642Z 
2025-05-07T20:33:13.0592762Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:13.0592967Z 
2025-05-07T20:33:13.0593063Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0593460Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0593846Z     T=16384,
2025-05-07T20:33:13.0594036Z     D=7168,
2025-05-07T20:33:13.0594216Z     scale_ub=1200.0,
2025-05-07T20:33:13.0594436Z     contiguous=True,
2025-05-07T20:33:13.0594645Z     compiled=False,
2025-05-07T20:33:13.0594832Z )
2025-05-07T20:33:13.0595136Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0595614Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:13.0595884Z 
2025-05-07T20:33:13.0595956Z     @given(
2025-05-07T20:33:13.0596174Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0596473Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0596768Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0597130Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0597443Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0597716Z     )
2025-05-07T20:33:13.0598052Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0598476Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0598713Z         self,
2025-05-07T20:33:13.0598894Z         T: int,
2025-05-07T20:33:13.0599081Z         D: int,
2025-05-07T20:33:13.0599286Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0599540Z         contiguous: bool,
2025-05-07T20:33:13.0599632Z         compiled: bool,
2025-05-07T20:33:13.0599707Z     ) -> None:
2025-05-07T20:33:13.0599797Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0599871Z     
2025-05-07T20:33:13.0600033Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0601877Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0601920Z 
2025-05-07T20:33:13.0602033Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:13.0602038Z 
2025-05-07T20:33:13.0602137Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0602351Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0602425Z     T=128,
2025-05-07T20:33:13.0602502Z     D=5120,
2025-05-07T20:33:13.0602584Z     scale_ub=1200.0,
2025-05-07T20:33:13.0602663Z     contiguous=False,
2025-05-07T20:33:13.0602786Z     compiled=False,
2025-05-07T20:33:13.0602856Z )
2025-05-07T20:33:13.0603067Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0603236Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:13.0603240Z 
2025-05-07T20:33:13.0603311Z     @given(
2025-05-07T20:33:13.0603427Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0603523Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0603631Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0603746Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0603855Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0603927Z     )
2025-05-07T20:33:13.0604169Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0604257Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0604331Z         self,
2025-05-07T20:33:13.0604411Z         T: int,
2025-05-07T20:33:13.0604486Z         D: int,
2025-05-07T20:33:13.0604580Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0604669Z         contiguous: bool,
2025-05-07T20:33:13.0604748Z         compiled: bool,
2025-05-07T20:33:13.0604826Z     ) -> None:
2025-05-07T20:33:13.0604919Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0604991Z     
2025-05-07T20:33:13.0605152Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0605232Z     
2025-05-07T20:33:13.0605322Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0605444Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0605530Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0605608Z         x0 = x[:, :D]
2025-05-07T20:33:13.0605682Z         x1 = x[:, D:]
2025-05-07T20:33:13.0605757Z     
2025-05-07T20:33:13.0605837Z         if contiguous:
2025-05-07T20:33:13.0605931Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0606019Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0606138Z     
2025-05-07T20:33:13.0606225Z         if scale_ub is not None:
2025-05-07T20:33:13.0606327Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0606457Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0606531Z             )
2025-05-07T20:33:13.0606604Z         else:
2025-05-07T20:33:13.0606698Z             scale_ub_tensor = None
2025-05-07T20:33:13.0606774Z     
2025-05-07T20:33:13.0606899Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0606984Z             op = silu_mul_quant
2025-05-07T20:33:13.0607068Z             if compiled:
2025-05-07T20:33:13.0607162Z                 op = torch.compile(op)
2025-05-07T20:33:13.0607265Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0607330Z     
2025-05-07T20:33:13.0607417Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0607422Z 
2025-05-07T20:33:13.0607516Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0607686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0607786Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0607884Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0608379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0608473Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0608868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0609084Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0609420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0609511Z     kernel = self.compile(
2025-05-07T20:33:13.0609887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0610134Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0610279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0610283Z 
2025-05-07T20:33:13.0610487Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a15db530>
2025-05-07T20:33:13.0611253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0611751Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a14107c0>}
2025-05-07T20:33:13.0612489Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0612677Z context = <triton._C.libtriton.ir.context object at 0x7f03a14b70f0>
2025-05-07T20:33:13.0612681Z 
2025-05-07T20:33:13.0612843Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0613160Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0613264Z                            module_map=module_map)
2025-05-07T20:33:13.0613431Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0613524Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0613598Z E       ^
2025-05-07T20:33:13.0613942Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0613947Z 
2025-05-07T20:33:13.0614350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0614354Z 
2025-05-07T20:33:13.0614503Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0614719Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0614792Z     T=2048,
2025-05-07T20:33:13.0614867Z     D=7168,
2025-05-07T20:33:13.0614944Z     scale_ub=None,
2025-05-07T20:33:13.0615029Z     contiguous=False,
2025-05-07T20:33:13.0615115Z     compiled=False,
2025-05-07T20:33:13.0615187Z )
2025-05-07T20:33:13.0615405Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0615571Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:13.0615576Z 
2025-05-07T20:33:13.0615648Z     @given(
2025-05-07T20:33:13.0615766Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0615861Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0615973Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0616089Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0616269Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0616342Z     )
2025-05-07T20:33:13.0616584Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0616675Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0616753Z         self,
2025-05-07T20:33:13.0616824Z         T: int,
2025-05-07T20:33:13.0616897Z         D: int,
2025-05-07T20:33:13.0617043Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0617128Z         contiguous: bool,
2025-05-07T20:33:13.0617209Z         compiled: bool,
2025-05-07T20:33:13.0617288Z     ) -> None:
2025-05-07T20:33:13.0617378Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0617449Z     
2025-05-07T20:33:13.0617615Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0619409Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0619420Z 
2025-05-07T20:33:13.0619538Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:13.0619542Z 
2025-05-07T20:33:13.0619641Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0619863Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0619940Z     T=128,
2025-05-07T20:33:13.0620010Z     D=7168,
2025-05-07T20:33:13.0620092Z     scale_ub=1200.0,
2025-05-07T20:33:13.0620182Z     contiguous=True,
2025-05-07T20:33:13.0620274Z     compiled=True,
2025-05-07T20:33:13.0620352Z )
2025-05-07T20:33:13.0620586Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0620749Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:13.0620753Z 
2025-05-07T20:33:13.0620828Z     @given(
2025-05-07T20:33:13.0620938Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0621037Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0621145Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0621259Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0621371Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0621442Z     )
2025-05-07T20:33:13.0621682Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0621775Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0621848Z         self,
2025-05-07T20:33:13.0621920Z         T: int,
2025-05-07T20:33:13.0621995Z         D: int,
2025-05-07T20:33:13.0622092Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0622225Z         contiguous: bool,
2025-05-07T20:33:13.0622309Z         compiled: bool,
2025-05-07T20:33:13.0622382Z     ) -> None:
2025-05-07T20:33:13.0622476Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0622546Z     
2025-05-07T20:33:13.0622709Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0622781Z     
2025-05-07T20:33:13.0622871Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0622996Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0623088Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0623166Z         x0 = x[:, :D]
2025-05-07T20:33:13.0623243Z         x1 = x[:, D:]
2025-05-07T20:33:13.0623320Z     
2025-05-07T20:33:13.0623401Z         if contiguous:
2025-05-07T20:33:13.0623492Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0623581Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0623655Z     
2025-05-07T20:33:13.0623744Z         if scale_ub is not None:
2025-05-07T20:33:13.0623894Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0624027Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0624105Z             )
2025-05-07T20:33:13.0624178Z         else:
2025-05-07T20:33:13.0624271Z             scale_ub_tensor = None
2025-05-07T20:33:13.0624344Z     
2025-05-07T20:33:13.0624469Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0624597Z             op = silu_mul_quant
2025-05-07T20:33:13.0624679Z             if compiled:
2025-05-07T20:33:13.0624776Z                 op = torch.compile(op)
2025-05-07T20:33:13.0624876Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0624950Z     
2025-05-07T20:33:13.0625037Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0625042Z 
2025-05-07T20:33:13.0625142Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0625265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0625361Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0625508Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0625873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:13.0625961Z     return fn(*args, **kwargs)
2025-05-07T20:33:13.0626447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0626545Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0626899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0627119Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0627451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0627545Z     kernel = self.compile(
2025-05-07T20:33:13.0627924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0628095Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0628221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0628225Z 
2025-05-07T20:33:13.0628428Z self = <triton.compiler.compiler.ASTSource object at 0x7f03a139f5c0>
2025-05-07T20:33:13.0629195Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0629689Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f05bde15440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f03a1411940>}
2025-05-07T20:33:13.0630445Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0630706Z context = <triton._C.libtriton.ir.context object at 0x7f03a13dcb70>
2025-05-07T20:33:13.0630711Z 
2025-05-07T20:33:13.0630869Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0631129Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0631236Z                            module_map=module_map)
2025-05-07T20:33:13.0631394Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0631494Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0631567Z E       ^
2025-05-07T20:33:13.0631916Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0631920Z 
2025-05-07T20:33:13.0632368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0632376Z 
2025-05-07T20:33:13.0632473Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0632692Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0632768Z     T=128,
2025-05-07T20:33:13.0632844Z     D=7168,
2025-05-07T20:33:13.0632926Z     scale_ub=1200.0,
2025-05-07T20:33:13.0633007Z     contiguous=True,
2025-05-07T20:33:13.0633131Z     compiled=False,
2025-05-07T20:33:13.0633203Z )
2025-05-07T20:33:13.0633412Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0633578Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:13.0633583Z 
2025-05-07T20:33:13.0633657Z     @given(
2025-05-07T20:33:13.0633771Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0633869Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0633980Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0634140Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0634253Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0634324Z     )
2025-05-07T20:33:13.0634568Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0634657Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0634728Z         self,
2025-05-07T20:33:13.0634806Z         T: int,
2025-05-07T20:33:13.0634882Z         D: int,
2025-05-07T20:33:13.0634977Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0635067Z         contiguous: bool,
2025-05-07T20:33:13.0635148Z         compiled: bool,
2025-05-07T20:33:13.0635223Z     ) -> None:
2025-05-07T20:33:13.0635317Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0635388Z     
2025-05-07T20:33:13.0635556Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0635630Z     
2025-05-07T20:33:13.0635723Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0635857Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0637605Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0637613Z 
2025-05-07T20:33:13.0637730Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:13.0637734Z 
2025-05-07T20:33:13.0637832Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0638045Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0638120Z     T=128,
2025-05-07T20:33:13.0638240Z     D=5120,
2025-05-07T20:33:13.0638323Z     scale_ub=1200.0,
2025-05-07T20:33:13.0638406Z     contiguous=True,
2025-05-07T20:33:13.0638485Z     compiled=True,
2025-05-07T20:33:13.0638557Z )
2025-05-07T20:33:13.0638769Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0638929Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:13.0638936Z 
2025-05-07T20:33:13.0639013Z     @given(
2025-05-07T20:33:13.0639123Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0639219Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0639333Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0639446Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0639553Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0639628Z     )
2025-05-07T20:33:13.0639873Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0640033Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0640122Z         self,
2025-05-07T20:33:13.0640205Z         T: int,
2025-05-07T20:33:13.0640283Z         D: int,
2025-05-07T20:33:13.0640376Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0640461Z         contiguous: bool,
2025-05-07T20:33:13.0640547Z         compiled: bool,
2025-05-07T20:33:13.0640621Z     ) -> None:
2025-05-07T20:33:13.0640710Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0640830Z     
2025-05-07T20:33:13.0640994Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0641068Z     
2025-05-07T20:33:13.0641158Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0641280Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0643056Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0643064Z 
2025-05-07T20:33:13.0643177Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:13.0643185Z 
2025-05-07T20:33:13.0643285Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0643501Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0643577Z     T=128,
2025-05-07T20:33:13.0643652Z     D=7168,
2025-05-07T20:33:13.0643733Z     scale_ub=None,
2025-05-07T20:33:13.0643816Z     contiguous=True,
2025-05-07T20:33:13.0643899Z     compiled=True,
2025-05-07T20:33:13.0643970Z )
2025-05-07T20:33:13.0644183Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0644351Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:13.0644356Z 
2025-05-07T20:33:13.0644431Z     @given(
2025-05-07T20:33:13.0644548Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0644645Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0644756Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0644873Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0644982Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0645053Z     )
2025-05-07T20:33:13.0645293Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0645381Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0645454Z         self,
2025-05-07T20:33:13.0645527Z         T: int,
2025-05-07T20:33:13.0645598Z         D: int,
2025-05-07T20:33:13.0645697Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0645787Z         contiguous: bool,
2025-05-07T20:33:13.0645945Z         compiled: bool,
2025-05-07T20:33:13.0646025Z     ) -> None:
2025-05-07T20:33:13.0646117Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0646189Z     
2025-05-07T20:33:13.0646355Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0648086Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:13.0648094Z 
2025-05-07T20:33:13.0648207Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:13.0648381Z =============================== warnings summary ===============================
2025-05-07T20:33:13.0648684Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:13.0648981Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:13.0649272Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:13.0650174Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:33:13.0650396Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:33:13.0650401Z 
2025-05-07T20:33:13.0650611Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:33:13.0650812Z ================= 1 failed, 1 deselected, 3 warnings in 13.06s =================
2025-05-07T20:33:14.5652384Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:14.6271759Z [EXEC] [ATTEMPT 2/2] Command attempt failed.
2025-05-07T20:33:14.6272008Z 
2025-05-07T20:33:14.6272966Z [EXEC] The command has failed after 2 + 1 attempts; aborting.
2025-05-07T20:33:14.6273570Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py
2025-05-07T20:33:14.6273972Z 
2025-05-07T20:33:14.6273979Z 
2025-05-07T20:33:14.6273984Z 
2025-05-07T20:33:14.6289872Z ##[error]Process completed with exit code 1.
2025-05-07T20:33:14.6377085Z Post job cleanup.
2025-05-07T20:33:14.7366852Z [command]/usr/bin/git version
2025-05-07T20:33:14.7411652Z git version 2.47.1
2025-05-07T20:33:14.7449910Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/7cd6b761-893f-479d-9f5d-dbf6b58edd7b/.gitconfig'
2025-05-07T20:33:14.7461195Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/7cd6b761-893f-479d-9f5d-dbf6b58edd7b' before making global git config changes
2025-05-07T20:33:14.7462033Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:33:14.7466959Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:33:14.7511039Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:33:14.7545674Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:33:14.7879797Z Entering 'external/asmjit'
2025-05-07T20:33:14.7946769Z Entering 'external/composable_kernel'
2025-05-07T20:33:14.8021281Z Entering 'external/cpuinfo'
2025-05-07T20:33:14.8087678Z Entering 'external/cutlass'
2025-05-07T20:33:14.8161678Z Entering 'external/googletest'
2025-05-07T20:33:14.8228853Z Entering 'external/hipify_torch'
2025-05-07T20:33:14.8294875Z Entering 'external/json'
2025-05-07T20:33:14.8383489Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:33:14.8408818Z http.https://github.com/.extraheader
2025-05-07T20:33:14.8420610Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2025-05-07T20:33:14.8452360Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:33:14.8779008Z Entering 'external/asmjit'
2025-05-07T20:33:14.8820675Z http.https://github.com/.extraheader
2025-05-07T20:33:14.8865407Z Entering 'external/composable_kernel'
2025-05-07T20:33:14.8908953Z http.https://github.com/.extraheader
2025-05-07T20:33:14.8959067Z Entering 'external/cpuinfo'
2025-05-07T20:33:14.9001980Z http.https://github.com/.extraheader
2025-05-07T20:33:14.9044795Z Entering 'external/cutlass'
2025-05-07T20:33:14.9088725Z http.https://github.com/.extraheader
2025-05-07T20:33:14.9140169Z Entering 'external/googletest'
2025-05-07T20:33:14.9182255Z http.https://github.com/.extraheader
2025-05-07T20:33:14.9224508Z Entering 'external/hipify_torch'
2025-05-07T20:33:14.9266306Z http.https://github.com/.extraheader
2025-05-07T20:33:14.9308836Z Entering 'external/json'
2025-05-07T20:33:14.9350141Z http.https://github.com/.extraheader
2025-05-07T20:33:14.9503778Z A job completed hook has been configured by the self-hosted runner administrator
2025-05-07T20:33:14.9534602Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh'
2025-05-07T20:33:14.9544948Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:33:14.9545317Z ##[endgroup]
2025-05-07T20:33:14.9642161Z [!ALERT!] Swap in detected! [!ALERT!]
2025-05-07T20:33:25.7194798Z [!ALERT!] Swap out detected [!ALERT!]
2025-05-07T20:33:42.1539827Z Cleaning up orphan processes